Disclosure of Invention
The invention aims to provide an article generating method based on multidimensional data, which aims to solve the defects in the background technology.
In order to achieve the above object, the present invention provides the following technical solutions: the article generating method based on the multi-dimensional data comprises the following steps:
preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;
step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;
thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;
and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences.
In a preferred embodiment, the construction flow for constructing the associated semantic chain network is as follows:
selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation based on words as an association rule;
carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:
wherein the method comprises the steps of
Representing the number of occurrences of word a and word b in the same text, +.>
Text number representing the occurrence of word a, < +.>
Representing text having word b present, wherein +.>
And->
Is the weight coefficient of the word a and the word b.
In a preferred embodiment, the text map association is:
acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:
wherein the method comprises the steps of
Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>
Set to 1; />
Is->
Probability of being selected;
is document d and event semantic Community->
Is defined by document d and community +.>
Similarity of (2)
Association degree->
Calculated by the formula:
in a preferred embodiment, the process of reconstructing the semantic community is:
the edge weight of the semantic community is adjusted through a formula:
wherein the method comprises the steps of
After the text mapping is finished, the semantic chain weights between the nodes a and b are +.>
Is the semantic chain weight between nodes a and b in the community at the end of the community division,
then the semantic chain weights between nodes a and b in the original associated semantic chain network;
the text recognition rate is obtained by the following steps:
number of partitioned semantic communities
And text quantity->
The closer the ratio is to 1, the closer the result of semantic community division is to the actual situation;
in a preferred embodiment, the keyword relevance text set is generated in the following manner:
and inputting keywords, performing answering segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page.
In a preferred embodiment, the ranking procedure of the candidate text sentences is as follows:
the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:
constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;
the text candidate sentence score is calculated in the following calculation mode:
the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained; with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:
wherein the method comprises the steps of
For one text sentence in the text sentence set S, the vector is expressed as
Wherein->
TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:
wherein the method comprises the steps of
Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;
the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:
representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and
set to 0, i.e., ignore the self-connection of the statement node:
calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:
wherein e is an event semantic community divided by a community discovery algorithm;
the saliency of the modified sentence node is expressed by the following calculation formula:
s refers to the whole text sentence collection, d is the adjustable parameter in the interval [0, 1 ]; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:
wherein A, B are all square matrices, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics.
Is the final score for the text sentence set.
In a preferred embodiment, the redundancy processing manner of the candidate text sentence is as follows:
defining two collections
And->
Wherein->
Is empty set, is->
The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;
for a pair of
The middle elements are sorted in descending order;
suppose at this time
Element->
Is the text sentence with the highest score, will +.>
From->
Move to +.>
Recalculate +.>
The scores of all text sentences are calculated according to the following formula:
execution to
The blank set or the text is reached to a preset condition limit;
finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:
when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;
when two candidate text sentences
All from the same data text, then +.>
And->
Ranking the candidate texts according to the sequence in the same data text;
when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and the candidate text sentences with high relativity are arranged in front of the candidate text sentences with low relativity.
In a preferred embodiment, the method for accurately locating the candidate text sentence is as follows:
selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for the removal reason, carrying out information acquisition, and avoiding the text sentences of the removal reason when the key words are input subsequently for data retrieval.
In the technical scheme, the invention has the technical effects and advantages that:
1. according to the text searching method and the text searching device, related texts can be subjected to content searching and text sentence scoring according to the input keywords, and then the text sentences are ranked according to the similarity of the keywords, so that semantics can be spontaneously found from a text set, multiple documents are generated, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In embodiment 1, referring to fig. 1, the article generating method based on multi-dimensional data in this embodiment includes the following steps:
preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;
the construction flow for constructing the associated semantic chain network comprises the following steps:
selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation of words as an association rule (namely, two semantic nodes which occur simultaneously in the same text are regarded as being associated, and measuring the association degree of the semantic nodes through the weight of a semantic chain);
carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:
wherein the method comprises the steps of
Representing the number of occurrences of word a and word b in the same text, +.>
Text number representing the occurrence of word a, < +.>
Representing text having word b present, wherein +.>
And->
Is the weight coefficient of the word a and the word b;
text mapping the association mode is as follows:
acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:
wherein the method comprises the steps of
Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>
Set to 1; />
Is->
Probability of being selected;
is document d and event semantic Community->
Is defined by document d and community +.>
Similarity of (2)
Association degree->
Calculated by the formula:
the process of reconstructing the semantic community is as follows:
the edge weight of the semantic community is adjusted through a formula:
wherein the method comprises the steps of
Is that after the text mapping is finished, the semantics are associated through the text frameworksSemantic chain weight between nodes a and b when linking the network, +.>
Is the semantic chain weight between nodes a and b in the community at the end of the community division,
then the semantic chain weights between nodes a and b in the original associated semantic chain network;
the text recognition rate is obtained by the following steps:
number of partitioned semantic communities
And text quantity->
The closer the ratio is to 1, the closer the result of semantic community division is to the actual situation;
on the other hand, the purity index is used for verifying the accuracy of the algorithm; the specific calculation formula is as follows;
wherein the method comprises the steps of
Is a weighted harmonic mean of accuracy and recall, get +.>
;
Step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;
inputting keywords, performing simple answer word segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page to form a keyword correlation text set;
thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;
the generation mode of the multi-text is that multi-dimensional data statistics is carried out on the text set (the multi-dimensional data statistics comprise aspects of text source websites, media attention and the like); clustering the text sets by using a text similarity analysis method, spontaneously finding semantics from the text sets, combining a multi-document generation method, and generating articles according to user requirements;
the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:
as shown in fig. 2, constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;
the text candidate sentence score is calculated in the following calculation mode:
the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained (without the operations of word segmentation, word stopping and the like on the text sentences); with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:
wherein the method comprises the steps of
For one text sentence in the text sentence set S, the vector is expressed as
Wherein->
TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:
wherein the method comprises the steps of
Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;
the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:
representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and
set to 0, i.e., ignore the self-connection of the statement node:
calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:
wherein e is an event semantic community divided by a community discovery algorithm;
the saliency of the modified sentence node is expressed by the following calculation formula:
s refers to a whole text sentence set, d is an adjustable parameter in an interval [0, 1] and is used for adjusting the proportion of similarity between sentence nodes and event topics and the similarity between semantic nodes in significance calculation; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:
wherein A, B are all square matrices, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics.
Is the final score for the text sentence set;
redundancy processing:
defining two collections
And->
Wherein->
Is empty set, is->
The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;
for a pair of
The middle elements are sorted in descending order;
suppose at this time
Element->
Is the text sentence with the highest score, will +.>
From->
Move to +.>
Recalculate +.>
The scores of all text sentences are calculated according to the following formula:
execution to
The blank set or the text is reached to a preset condition limit;
finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:
when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;
when two candidate text sentences
All from the same data text, then +.>
And->
Ranking the candidate texts according to the sequence in the same data text;
when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and candidate text sentences with high relativity are arranged in front of candidate text sentences with low relativity;
step four, screening the candidate text sentences by the referents according to sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences;
selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for removing reasons to acquire information, and avoiding the text sentences of the removing reasons when the key words are input subsequently for data retrieval;
the text searching method and the text searching device can search the content of the related text and acquire the score of the text sentence according to the input keywords, and then sort the text sentence according to the similarity of the keywords, so that the semantics can be found spontaneously from the text set, the generation of multiple documents is performed, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.