Movatterモバイル変換


[0]ホーム

URL:


CN116414939A - Article generation method based on multidimensional data - Google Patents

Article generation method based on multidimensional data
Download PDF

Info

Publication number
CN116414939A
CN116414939ACN202310661303.7ACN202310661303ACN116414939ACN 116414939 ACN116414939 ACN 116414939ACN 202310661303 ACN202310661303 ACN 202310661303ACN 116414939 ACN116414939 ACN 116414939A
Authority
CN
China
Prior art keywords
text
sentence
candidate
semantic
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310661303.7A
Other languages
Chinese (zh)
Other versions
CN116414939B (en
Inventor
陈毅凯
杨石
张凌哲
胡小武
江朋欣
陈杰杰
江永胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Guozhun Data Co ltd
Original Assignee
Nanjing Guozhun Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Guozhun Data Co ltdfiledCriticalNanjing Guozhun Data Co ltd
Priority to CN202310661303.7ApriorityCriticalpatent/CN116414939B/en
Publication of CN116414939ApublicationCriticalpatent/CN116414939A/en
Application grantedgrantedCritical
Publication of CN116414939BpublicationCriticalpatent/CN116414939B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

The invention discloses an article generating method based on multidimensional data, which relates to the technical field of article generation, and comprises the following steps of firstly, preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates; step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed; thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts; and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts. The method and the system can spontaneously find the semantics from the text set, generate multiple documents, generate articles according to the requirements of users, enable the users to quickly check the data, save the retrieval time and improve the data acquisition efficiency.

Description

Article generation method based on multidimensional data
Technical Field
The invention relates to the technical field of article generation, in particular to an article generation method based on multi-dimensional data.
Background
The key words are input by a user when the user uses a search engine, and can furthest summarize information contents to be searched by the user; and identifying and corresponding the text data according to the keywords, and recommending the text.
The prior art has the following defects: the text searched according to the keywords cannot be scored for similarity and importance, the semantics are difficult to be spontaneously found from the text set to be combined and generated for data recommendation, the search time of the user cannot be saved to a greater extent, and the efficiency of searching the data is difficult to be improved.
Disclosure of Invention
The invention aims to provide an article generating method based on multidimensional data, which aims to solve the defects in the background technology.
In order to achieve the above object, the present invention provides the following technical solutions: the article generating method based on the multi-dimensional data comprises the following steps:
preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;
step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;
thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;
and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences.
In a preferred embodiment, the construction flow for constructing the associated semantic chain network is as follows:
selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation based on words as an association rule;
carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:
Figure SMS_1
wherein the method comprises the steps of
Figure SMS_2
Representing the number of occurrences of word a and word b in the same text, +.>
Figure SMS_3
Text number representing the occurrence of word a, < +.>
Figure SMS_4
Representing text having word b present, wherein +.>
Figure SMS_5
And->
Figure SMS_6
Is the weight coefficient of the word a and the word b.
In a preferred embodiment, the text map association is:
acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:
Figure SMS_7
wherein the method comprises the steps of
Figure SMS_8
Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>
Figure SMS_9
Set to 1; />
Figure SMS_10
Is->
Figure SMS_11
Probability of being selected;
Figure SMS_12
is document d and event semantic Community->
Figure SMS_13
Is defined by document d and community +.>
Figure SMS_14
Similarity of (2)
Figure SMS_15
Association degree->
Figure SMS_16
Calculated by the formula:
Figure SMS_17
in a preferred embodiment, the process of reconstructing the semantic community is:
the edge weight of the semantic community is adjusted through a formula:
Figure SMS_18
wherein the method comprises the steps of
Figure SMS_19
After the text mapping is finished, the semantic chain weights between the nodes a and b are +.>
Figure SMS_20
Is the semantic chain weight between nodes a and b in the community at the end of the community division,
Figure SMS_21
then the semantic chain weights between nodes a and b in the original associated semantic chain network;
the text recognition rate is obtained by the following steps:
number of partitioned semantic communities
Figure SMS_22
And text quantity->
Figure SMS_23
The closer the ratio is to 1, the closer the result of semantic community division is to the actual situation;
Figure SMS_24
in a preferred embodiment, the keyword relevance text set is generated in the following manner:
and inputting keywords, performing answering segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page.
In a preferred embodiment, the ranking procedure of the candidate text sentences is as follows:
the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:
constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;
the text candidate sentence score is calculated in the following calculation mode:
the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained; with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:
Figure SMS_25
wherein the method comprises the steps of
Figure SMS_26
For one text sentence in the text sentence set S, the vector is expressed as
Figure SMS_27
Wherein->
Figure SMS_28
TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:
Figure SMS_29
wherein the method comprises the steps of
Figure SMS_30
Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;
the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:
Figure SMS_31
representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and
Figure SMS_32
set to 0, i.e., ignore the self-connection of the statement node:
Figure SMS_33
calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:
Figure SMS_34
wherein e is an event semantic community divided by a community discovery algorithm;
the saliency of the modified sentence node is expressed by the following calculation formula:
Figure SMS_35
s refers to the whole text sentence collection, d is the adjustable parameter in the interval [0, 1 ]; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:
Figure SMS_36
wherein A, B are all square matrices, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics.
Figure SMS_37
Is the final score for the text sentence set.
In a preferred embodiment, the redundancy processing manner of the candidate text sentence is as follows:
defining two collections
Figure SMS_38
And->
Figure SMS_39
Wherein->
Figure SMS_40
Is empty set, is->
Figure SMS_41
The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;
for a pair of
Figure SMS_42
The middle elements are sorted in descending order;
suppose at this time
Figure SMS_43
Element->
Figure SMS_44
Is the text sentence with the highest score, will +.>
Figure SMS_45
From->
Figure SMS_46
Move to +.>
Figure SMS_47
Recalculate +.>
Figure SMS_48
The scores of all text sentences are calculated according to the following formula:
Figure SMS_49
execution to
Figure SMS_50
The blank set or the text is reached to a preset condition limit;
finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:
when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;
when two candidate text sentences
Figure SMS_51
All from the same data text, then +.>
Figure SMS_52
And->
Figure SMS_53
Ranking the candidate texts according to the sequence in the same data text;
when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and the candidate text sentences with high relativity are arranged in front of the candidate text sentences with low relativity.
In a preferred embodiment, the method for accurately locating the candidate text sentence is as follows:
selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for the removal reason, carrying out information acquisition, and avoiding the text sentences of the removal reason when the key words are input subsequently for data retrieval.
In the technical scheme, the invention has the technical effects and advantages that:
1. according to the text searching method and the text searching device, related texts can be subjected to content searching and text sentence scoring according to the input keywords, and then the text sentences are ranked according to the similarity of the keywords, so that semantics can be spontaneously found from a text set, multiple documents are generated, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is an edge-structured undirected view of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In embodiment 1, referring to fig. 1, the article generating method based on multi-dimensional data in this embodiment includes the following steps:
preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;
the construction flow for constructing the associated semantic chain network comprises the following steps:
selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation of words as an association rule (namely, two semantic nodes which occur simultaneously in the same text are regarded as being associated, and measuring the association degree of the semantic nodes through the weight of a semantic chain);
carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:
Figure SMS_54
wherein the method comprises the steps of
Figure SMS_55
Representing the number of occurrences of word a and word b in the same text, +.>
Figure SMS_56
Text number representing the occurrence of word a, < +.>
Figure SMS_57
Representing text having word b present, wherein +.>
Figure SMS_58
And->
Figure SMS_59
Is the weight coefficient of the word a and the word b;
text mapping the association mode is as follows:
acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:
Figure SMS_60
wherein the method comprises the steps of
Figure SMS_61
Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>
Figure SMS_62
Set to 1; />
Figure SMS_63
Is->
Figure SMS_64
Probability of being selected;
Figure SMS_65
is document d and event semantic Community->
Figure SMS_66
Is defined by document d and community +.>
Figure SMS_67
Similarity of (2)
Figure SMS_68
Association degree->
Figure SMS_69
Calculated by the formula:
Figure SMS_70
the process of reconstructing the semantic community is as follows:
the edge weight of the semantic community is adjusted through a formula:
Figure SMS_71
wherein the method comprises the steps of
Figure SMS_72
Is that after the text mapping is finished, the semantics are associated through the text frameworksSemantic chain weight between nodes a and b when linking the network, +.>
Figure SMS_73
Is the semantic chain weight between nodes a and b in the community at the end of the community division,
Figure SMS_74
then the semantic chain weights between nodes a and b in the original associated semantic chain network;
the text recognition rate is obtained by the following steps:
number of partitioned semantic communities
Figure SMS_75
And text quantity->
Figure SMS_76
The closer the ratio is to 1, the closer the result of semantic community division is to the actual situation;
Figure SMS_77
on the other hand, the purity index is used for verifying the accuracy of the algorithm; the specific calculation formula is as follows;
Figure SMS_78
wherein the method comprises the steps of
Figure SMS_79
Is a weighted harmonic mean of accuracy and recall, get +.>
Figure SMS_80
Step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;
inputting keywords, performing simple answer word segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page to form a keyword correlation text set;
thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;
the generation mode of the multi-text is that multi-dimensional data statistics is carried out on the text set (the multi-dimensional data statistics comprise aspects of text source websites, media attention and the like); clustering the text sets by using a text similarity analysis method, spontaneously finding semantics from the text sets, combining a multi-document generation method, and generating articles according to user requirements;
the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:
as shown in fig. 2, constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;
the text candidate sentence score is calculated in the following calculation mode:
the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained (without the operations of word segmentation, word stopping and the like on the text sentences); with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:
Figure SMS_81
wherein the method comprises the steps of
Figure SMS_82
For one text sentence in the text sentence set S, the vector is expressed as
Figure SMS_83
Wherein->
Figure SMS_84
TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:
Figure SMS_85
wherein the method comprises the steps of
Figure SMS_86
Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;
the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:
Figure SMS_87
representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and
Figure SMS_88
set to 0, i.e., ignore the self-connection of the statement node:
Figure SMS_89
calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:
Figure SMS_90
wherein e is an event semantic community divided by a community discovery algorithm;
the saliency of the modified sentence node is expressed by the following calculation formula:
Figure SMS_91
s refers to a whole text sentence set, d is an adjustable parameter in an interval [0, 1] and is used for adjusting the proportion of similarity between sentence nodes and event topics and the similarity between semantic nodes in significance calculation; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:
Figure SMS_92
wherein A, B are all square matrices, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics.
Figure SMS_93
Is the final score for the text sentence set;
redundancy processing:
defining two collections
Figure SMS_94
And->
Figure SMS_95
Wherein->
Figure SMS_96
Is empty set, is->
Figure SMS_97
The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;
for a pair of
Figure SMS_98
The middle elements are sorted in descending order;
suppose at this time
Figure SMS_99
Element->
Figure SMS_100
Is the text sentence with the highest score, will +.>
Figure SMS_101
From->
Figure SMS_102
Move to +.>
Figure SMS_103
Recalculate +.>
Figure SMS_104
The scores of all text sentences are calculated according to the following formula:
Figure SMS_105
execution to
Figure SMS_106
The blank set or the text is reached to a preset condition limit;
finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:
when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;
when two candidate text sentences
Figure SMS_107
All from the same data text, then +.>
Figure SMS_108
And->
Figure SMS_109
Ranking the candidate texts according to the sequence in the same data text;
when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and candidate text sentences with high relativity are arranged in front of candidate text sentences with low relativity;
step four, screening the candidate text sentences by the referents according to sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences;
selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for removing reasons to acquire information, and avoiding the text sentences of the removing reasons when the key words are input subsequently for data retrieval;
the text searching method and the text searching device can search the content of the related text and acquire the score of the text sentence according to the input keywords, and then sort the text sentence according to the similarity of the keywords, so that the semantics can be found spontaneously from the text set, the generation of multiple documents is performed, articles are generated according to the requirements of users, the users can quickly check the data, the searching time is saved, and the data acquisition efficiency is improved.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. The article generating method based on the multi-dimensional data is characterized by comprising the following steps of:
preprocessing text data and establishing a data index base, wherein the establishing mode of the data index base comprises the steps of establishing a related semantic chain network, mapping texts, reconstructing semantic communities and acquiring text recognition rates;
step two, data retrieval is carried out by inputting keywords, and data searching is carried out in a data index base, so that a keyword correlation text set is formed;
thirdly, generating multiple texts according to a network diagram structure from a text set obtained by the keywords, and sequencing the texts;
and fourthly, screening the candidate text sentences by the referents according to the sentence ordering of the candidate texts to form accurate positioning of the candidate text sentences.
2. The article generation method based on multi-dimensional data according to claim 1, wherein: the construction flow for constructing the associated semantic chain network comprises the following steps:
selecting a word set of a text preprocessing part as a semantic node set, and taking a co-occurrence relation based on words as an association rule;
carrying out community division on semantic communities in the related semantic chain network through the weight of the semantic chain, so as to form a single semantic community describing the same event or theme; the weight calculation formula of the semantic chain is as follows:
Figure QLYQS_1
wherein the method comprises the steps of
Figure QLYQS_2
Representing the number of occurrences of word a and word b in the same text, +.>
Figure QLYQS_3
Text number representing the occurrence of word a, < +.>
Figure QLYQS_4
Representing text having word b present, wherein +.>
Figure QLYQS_5
And->
Figure QLYQS_6
Is the weight coefficient of the word a and the word b.
3. The article generation method based on multi-dimensional data according to claim 2, wherein: the text mapping association mode is as follows:
acquiring interaction information of the text and the semantic community to measure the similarity degree of the text and the semantic community; the smaller the interchange information, the lower the relevance between the text and the semantic community, and the lower the likelihood that the text is similar to the topic described by the semantic community; the higher the mutual information is, the higher the relevance is, the higher the likelihood that the text is similar to the semantic community description theme is, and the calculation formula of the mutual information is as follows:
Figure QLYQS_7
wherein the method comprises the steps of
Figure QLYQS_8
Is the probability that document d is selected, since the texts are independent during the mapping process of the texts and the community,/for each text>
Figure QLYQS_9
Set to 1; />
Figure QLYQS_10
Is->
Figure QLYQS_11
Probability of being selected;
Figure QLYQS_12
is document d and event semantic Community->
Figure QLYQS_13
Is defined by document d and community +.>
Figure QLYQS_14
Similarity of (2)
Figure QLYQS_15
Association degree->
Figure QLYQS_16
Calculated by the formula:
Figure QLYQS_17
4. the article generation method based on multi-dimensional data according to claim 3, wherein: the process of reconstructing the semantic community is as follows:
the edge weight of the semantic community is adjusted through a formula:
Figure QLYQS_18
wherein the method comprises the steps of
Figure QLYQS_19
After the text mapping is finished, the semantic chain weights between the nodes a and b are +.>
Figure QLYQS_20
Is the semantic chain weight between nodes a and b in the community at the end of community division, +.>
Figure QLYQS_21
Then the semantic chain weights between nodes a and b in the original associated semantic chain network;
the text recognition rate is obtained by the following steps:
number of partitioned semantic communities
Figure QLYQS_22
And text quantity->
Figure QLYQS_23
The ratio of (2) is closer to 1, explainingThe closer the result of semantic community division is to the actual situation;
Figure QLYQS_24
5. the article generation method based on multidimensional data according to claim 4, wherein: the generation mode of the keyword relevance text set is as follows:
and inputting keywords, performing answering segmentation on the input keywords, quickly searching a text set matched with the keywords from a data index library according to the keywords, sequencing the text set, and returning to a list display page.
6. The article generation method based on multidimensional data according to claim 5, wherein: the ranking flow of the candidate text sentences is as follows:
the text sentence score is obtained, and text sorting is carried out according to the text sentence score, wherein the text sorting flow is as follows:
constructing an undirected graph by taking sentences as nodes and taking the similarity between the nodes as edges through a LexRank algorithm; when importance scoring is carried out on each sentence node according to the degree of each sentence node and the weight of the edge, finally, the sentences with the scores arranged in the front are selected as the last text candidate sentences according to the importance scoring;
the text candidate sentence score is calculated in the following calculation mode:
the text candidate sentences are represented by adopting a vector space model, and because the text candidate sentences are generated by short texts, word sequences of the text candidate sentences can be directly obtained; with words as the smallest linguistic units in the VSM, a set of text sentences may be represented as:
Figure QLYQS_25
wherein the method comprises the steps of
Figure QLYQS_26
For one text sentence in the text sentence set S, the vector is expressed as
Figure QLYQS_27
Wherein->
Figure QLYQS_28
TF-ISF values representing the i-th term feature item in the kth text sentence; the calculation formula of TF-ISF is:
Figure QLYQS_29
wherein the method comprises the steps of
Figure QLYQS_30
Representing word frequency of the word w in the text sentence, N representing the number of text sentences in the text sentence set, N representing the number of text sentences containing the word w;
the similarity of each text sentence in the text sentence set is calculated through cosine similarity and is used as the weight among all nodes in the language network diagram, and the calculation mode is as follows:
Figure QLYQS_31
representing the structured undirected graph G by an n x n matrix A, where n is the total number of sentences in the text sentence set, and
Figure QLYQS_32
set to 0, i.e., ignore the self-connection of the statement node:
Figure QLYQS_33
calculating the semantic representation of the text candidate sentence and the event, and calculating the relevance of the text sentence and the event through a calculation formula:
Figure QLYQS_34
wherein e is an event semantic community divided by a community discovery algorithm;
the saliency of the modified sentence node is expressed by the following calculation formula:
Figure QLYQS_35
s refers to the whole text sentence collection, d is the adjustable parameter in the interval [0, 1 ]; when the d value is larger, more relevance between the text sentence and the event semantics is considered in the calculation of the saliency score; lexRank, by definition, can be converted into the following matrix form:
Figure QLYQS_36
wherein A, B is a square matrix, A matrix represents similarity between sentence nodes, and B matrix represents similarity between sentence nodes and event semantics;
Figure QLYQS_37
is the final score for the text sentence set.
7. The article generation method based on multidimensional data according to claim 6, wherein: the redundant processing mode of the candidate text sentence is as follows:
defining two collections
Figure QLYQS_38
And->
Figure QLYQS_39
Wherein->
Figure QLYQS_40
Is empty set, is->
Figure QLYQS_41
The element of each text sentence corresponds to the score of each text sentence in the previous step of significance score calculation result;
for a pair of
Figure QLYQS_42
The middle elements are sorted in descending order;
suppose at this time
Figure QLYQS_43
Element->
Figure QLYQS_44
Is the text sentence with the highest score, will +.>
Figure QLYQS_45
From->
Figure QLYQS_46
Move to +.>
Figure QLYQS_47
Recalculating
Figure QLYQS_48
The scores of all text sentences are calculated according to the following formula:
Figure QLYQS_49
execution to
Figure QLYQS_50
The blank set or the text is reached to a preset condition limit;
finally, sentence ordering is carried out on the candidate texts to form a candidate text sentence recommendation ranking sequence:
when the two candidate texts contain time information, the candidate texts are arranged according to the sequence of the time;
when two candidate text sentences
Figure QLYQS_51
All from the same data text, then +.>
Figure QLYQS_52
And->
Figure QLYQS_53
Ranking the candidate texts according to the sequence in the same data text;
when the two candidate texts do not contain time information and do not originate from the same data text, sorting is carried out through the relativity of the text sentences and event semantics, and the candidate text sentences with high relativity are arranged in front of the candidate text sentences with low relativity.
8. The article generation method based on multidimensional data according to claim 7, wherein: the accurate positioning method of the candidate text sentence comprises the following steps:
selecting the availability of the candidate text sentences, removing the candidate text sentences with offset references to form a new candidate text sentence recommendation ranking sequence, simultaneously carrying out feature recording on the removed candidate text sentences, simultaneously asking an operator for the removal reason, carrying out information acquisition, and avoiding the text sentences of the removal reason when the key words are input subsequently for data retrieval.
CN202310661303.7A2023-06-062023-06-06Article generation method based on multidimensional dataActiveCN116414939B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310661303.7ACN116414939B (en)2023-06-062023-06-06Article generation method based on multidimensional data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310661303.7ACN116414939B (en)2023-06-062023-06-06Article generation method based on multidimensional data

Publications (2)

Publication NumberPublication Date
CN116414939Atrue CN116414939A (en)2023-07-11
CN116414939B CN116414939B (en)2023-09-26

Family

ID=87059649

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310661303.7AActiveCN116414939B (en)2023-06-062023-06-06Article generation method based on multidimensional data

Country Status (1)

CountryLink
CN (1)CN116414939B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20090300003A1 (en)*2008-05-302009-12-03Kabushiki Kaisha ToshibaApparatus and method for supporting keyword input
CN103136352A (en)*2013-02-272013-06-05华中师范大学Full-text retrieval system based on two-level semantic analysis
CN110781679A (en)*2019-10-152020-02-11上海大学News event keyword mining method based on associated semantic chain network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20090300003A1 (en)*2008-05-302009-12-03Kabushiki Kaisha ToshibaApparatus and method for supporting keyword input
CN103136352A (en)*2013-02-272013-06-05华中师范大学Full-text retrieval system based on two-level semantic analysis
CN110781679A (en)*2019-10-152020-02-11上海大学News event keyword mining method based on associated semantic chain network

Also Published As

Publication numberPublication date
CN116414939B (en)2023-09-26

Similar Documents

PublicationPublication DateTitle
CN109829104B (en)Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN109101479B (en)Clustering method and device for Chinese sentences
US9613024B1 (en)System and methods for creating datasets representing words and objects
CN103136352B (en)Text retrieval system based on double-deck semantic analysis
JP4908214B2 (en) Systems and methods for providing search query refinement.
US8073877B2 (en)Scalable semi-structured named entity detection
CN104615593B (en)Hot microblog topic automatic testing method and device
US7409404B2 (en)Creating taxonomies and training data for document categorization
JP3636941B2 (en) Information retrieval method and information retrieval apparatus
CN111104794A (en)Text similarity matching method based on subject words
US20040249808A1 (en)Query expansion using query logs
Varma et al.IIIT Hyderabad at TAC 2009.
WO2011070832A1 (en)Method of searching for document data files based on keywords, and computer system and computer program thereof
CN104199965A (en)Semantic information retrieval method
WO2012159558A1 (en)Natural language processing method, device and system based on semantic recognition
CN110866102A (en)Search processing method
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN119988588A (en) A large model-based multimodal document retrieval enhancement generation method
CN115794995A (en)Target answer obtaining method and related device, electronic equipment and storage medium
KR20070007001A (en) Retrieval Method and Device Using Automatic Query Extraction
CN109284389A (en)A kind of information processing method of text data, device
CN118797005A (en) Intelligent question-answering method, device, electronic device, storage medium and product
Noaman et al.Naive Bayes classifier based Arabic document categorization
CN116401344A (en)Method and device for searching table according to question
JP4426041B2 (en) Information retrieval method by category factor

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp