Disclosure of Invention
The invention mainly aims to provide a method and a device for setting an article truncation point and computer equipment, and aims to solve the problems that in the prior art, the similarity of information contained in two adjacent sentences is mainly calculated, and the information of other sentences is ignored.
The invention provides a method for setting an article interception point, which comprises the following steps:
inputting each sentence in an article into a bert model to obtain a plurality of word vectors corresponding to each sentence, and inputting the word vectors into a bidirectional long-short term memory network in a word vector sequence form to obtain a first sentence vector and a second sentence vector corresponding to each sentence, wherein the first sentence vector is formed by sequentially splicing the word vector sequences, and the second sentence vector is formed by sequentially splicing the word vector sequences in a reverse order;
splicing the tail end of the first sentence vector of each sentence with the head end of the second sentence vector to obtain a target vector of each sentence;
selecting a target sentence from the article, weighting and calculating a target vector corresponding to each sentence from the head end of the article to the tail end of the target sentence to obtain a first vector, and weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
similarity calculation is carried out on a first vector and a second vector corresponding to the target sentence, sigmoid nonlinear mapping is carried out on the calculated first similarity value to a (0, 1) interval, and the linear distance between the first similarity value and 1 is calculated;
and comparing the linear distance with a set threshold, and taking the tail position of the target sentence as an initial interception point when the linear distance is higher than the set threshold.
Further, the step of calculating the similarity of the first vector and the second vector corresponding to the target sentence, and then performing sigmoid nonlinear mapping on the calculated first similarity value to a (0, 1) interval to find the linear distance from 1 includes:
by the formula
Calculating the first similarity value, wherein,
is the value of the first similarity value and is,
a first vector is represented by a first vector,
a second vector is represented that is a function of,
represents the i-th dimension of the first vector,
an ith dimension representing a second vector;
by the formula
Calculating a mapping value of the nonlinear mapping to the (0, 1) interval;
and solving the linear distance from 1 according to the mapping value.
Further, after the step of comparing the linear distance with a set threshold and taking the end position of the target sentence as an initial truncation point when the linear distance is higher than the set threshold, the method further includes:
acquiring a first text distance from each initial interception point to the head end of the article and a second text distance from each initial interception point to the tail end of the article;
according to the formula
Calculating the position score of each initial interception point, wherein K is the position score, X is the first text distance, and Y is the second text distance;
and selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each initial truncation point.
Further, the step of selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each initial truncation point includes:
recording a set formed by all the initial truncation points as a first set;
selecting a set formed by the preset number of initial interception points from the first set and recording the set as a second set;
calculating the score value of each second set through a calculation formula; wherein the calculation formula is
w and m are respectively preset weight parameters; h is
1 ,h
2 ,…,h
n A first similarity value corresponding to an element in the second set; Δ R
i The difference between the first similarity values corresponding to two elements selected from the second set for the ith group; n represents the number of elements in the second set, and F (n) represents a score value;
and selecting the second set with the highest score value, and taking the initial truncation point in the set as the target truncation point.
Further, before the step of selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each initial truncation point, the method further includes:
splicing the first sentence vector of each sentence in the article to obtain an article vector of the article;
searching the preset number corresponding to the target interception point in a preset list according to the dimensionality of the article vector; wherein the preset list includes a corresponding relationship between the dimensionality of the article vector and the preset number of the target interception points.
Further, the step of inputting each sentence in the article into the bert model to obtain a plurality of word vectors corresponding to each sentence, and inputting the word vectors into the bidirectional long-short term memory network in a word vector sequence form to obtain a first sentence vector and a second sentence vector corresponding to each sentence includes:
preprocessing the sentences, and establishing a TOKEN list according to the positions of the sentences in the article to record the positions of the sentences, wherein the preprocessing comprises removing punctuation marks in the problem, unifying languages and deleting irrelevant words and sentences, and the irrelevant words and sentences comprise greetings, adjectives and dirty words;
reading text data of a data set through the bert model, and constructing the word vector in a manner of fine-tuning through the bert model, wherein the bert model is trained on a word database;
and forming the word vector sequence by the word vectors according to the sequence in the sentence, sequentially splicing the word vectors according to the word vector sequence to form a first sentence vector, and sequentially splicing the word vectors in a reverse order to form a second sentence vector.
Further, the step of comparing the linear distance with a set threshold and taking the end position of the target sentence as an initial truncation point when the linear distance is higher than the set threshold comprises:
calculating a second similarity value of target sentence vectors of two adjacent sentences of each initial truncation point;
extracting the initial cut-off point of which the second similarity value is smaller than a preset similarity value as a first cut-off point;
and screening a target truncation point from the first truncation points according to a preset rule, and truncating the article through the target truncation point.
The invention also provides a device for setting the article interception point, which comprises:
the vectorization module is used for inputting each sentence in the article into the bert model to obtain a plurality of word vectors corresponding to each sentence, and inputting the word vectors into the bidirectional long-short term memory network in a word vector sequence form to obtain a first sentence vector and a second sentence vector corresponding to each sentence, wherein the first sentence vectors are formed by sequentially splicing the word vector sequences, and the second sentence vectors are formed by sequentially splicing the word vector sequences in an opposite order;
the vector splicing module is used for splicing the tail end of the first sentence vector of each sentence with the head end of the second sentence vector to obtain a target vector of each sentence;
the weighting and calculating module is used for selecting a target sentence from the article, weighting and calculating a target vector corresponding to each sentence from the head end of the article to the tail end of the target sentence to obtain a first vector, and weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
the first similarity value calculation module is used for carrying out similarity calculation on a first vector and a second vector corresponding to the target sentence, carrying out sigmoid nonlinear mapping on the calculated first similarity value to a (0, 1) interval and solving a linear distance between the first similarity value and 1;
and the initial truncation point setting module is used for comparing the linear distance with a set threshold value, and when the linear distance is higher than the set threshold value, taking the tail position of the target sentence as the initial truncation point.
The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of any of the above.
The invention has the beneficial effects that: the method comprises the steps of weighting and calculating a target vector corresponding to each sentence from the head end of an article to the tail end of a target sentence to obtain a first vector, weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector, and calculating the similarity, wherein the information of all sentences is fully considered, and the interception point of the article can be better selected.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that all directional indicators (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative position relationship between the components, the motion situation, etc. in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly, and the connection may be a direct connection or an indirect connection.
The term "and/or" herein is only one kind of association relationship describing the association object, and means that there may be three kinds of relationships, for example, a and B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In addition, descriptions such as "first", "second", etc. in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between the embodiments may be combined with each other, but must be based on the realization of the technical solutions by a person skilled in the art, and when the technical solutions are contradictory to each other or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a method for setting an article truncation point, including:
s1: inputting each sentence in an article into a bert model to obtain a plurality of word vectors corresponding to each sentence, and inputting the word vectors into a bidirectional long-short term memory network in a word vector sequence form to obtain a first sentence vector and a second sentence vector corresponding to each sentence, wherein the first sentence vector is formed by sequentially splicing the word vector sequences, and the second sentence vector is formed by sequentially splicing the word vector sequences in a reverse order;
s2: splicing the tail end of the first sentence vector of each sentence with the head end of the second sentence vector to obtain a target vector of each sentence;
s3: selecting a target sentence from the article, weighting and calculating a target vector corresponding to each sentence from the head end of the article to the tail end of the target sentence to obtain a first vector, and weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
s4: similarity calculation is carried out on a first vector and a second vector corresponding to the target sentence, sigmoid nonlinear mapping is carried out on the calculated first similarity value to a (0, 1) interval, and the linear distance between the first similarity value and 1 is calculated;
s5: and comparing the linear distance with a set threshold, and taking the tail position of the target sentence as an initial interception point when the linear distance is higher than the set threshold.
As described in step S1, each sentence in the article is input into the bert model to obtain a plurality of word vectors corresponding to each sentence, wherein the division of the sentences in the article is performed by sentence division symbols, that is, the content from the beginning of the article to the first sentence division symbol is a sentence, and the content between the sentence division symbols is a sentence, wherein the sentence division symbols may be sentence symbols expressed in chinese or english, and the sentence division symbols may be sentence signs such as a sentence mark, an exclamation mark, a question mark, and the like. The bert models can be trained based on the corpus databases of different categories to obtain different bert models, then the corresponding bert models are selected according to the categories of the articles to be input, and the word vectors generated by the models are better because the corresponding bert models are trained based on the corpus databases of the corresponding categories.
As described in step S2 above, in order to better calculate the information included in each sentence, a first vector formed by sequentially splicing word vector sequences and a second vector formed by sequentially splicing word vector sequences in the reverse order may be spliced to form a target vector, and a loss value of subsequent calculation may be reduced by the target vector, so that a result of subsequent similarity calculation is better.
As described in step S3, a target sentence is selected, where the target sentence is selected by sequentially selecting each sentence in the article, then a target vector corresponding to each sentence from the beginning of the article to the end of the target sentence is weighted and calculated to obtain a first vector, and a target vector corresponding to each sentence from the end of the target sentence to the end of the article is weighted and calculated to obtain a second vector, where the weighting and calculation includes performing dimension-up calculation or dimension-down calculation on the first vector and/or the second vector, etc., in order to keep the dimensions of the first vector and the second vector consistent, so as to facilitate subsequent similarity calculation.
As described in step S4, similarity calculation is performed on the first Vector and the second Vector, where a formula of the similarity calculation may be a WMD algorithm (word over' S distance), a simhash algorithm, an algorithm based on cosine similarity, calculation based on an SVM (Support Vector Machine) Vector model, and the like, and the similarity between the first Vector and the second Vector may be calculated. Then, in the interval of the first similarity value mapping value (0, 1) obtained by calculation, the similarity value can be embodied in the linear distance with 1, so that the subsequent judgment between the similarity value and the threshold value is facilitated.
As described in step S5 above, the similarity value is compared with the set threshold, it can be determined whether the end of each sentence satisfies the initial condition of segmentation, when the initial condition is satisfied, the end position of the corresponding target sentence can be used as the initial truncation point, the initial truncation point can be directly used as the final truncation point to truncate the article subsequently, when a plurality of truncation points are included, one or more initial truncation points can be selected to truncate the article, and the selection rule is not limited, for example, the initial truncation points whose numbers of characters of each paragraph after truncation are as small as possible, or the initial truncation point with the smallest similarity can be selected to truncate.
In one embodiment, the step S4 of performing similarity calculation on the first vector and the second vector corresponding to each truncation point, performing sigmoid nonlinear mapping on the calculated first similarity value to a (0, 1) interval, and calculating a linear distance from 1 includes:
s401: by the formula
Calculating the first similarity value, wherein,
is the value of the first similarity value and is,
a first vector is represented by a first vector,
a second vector is represented that represents the second vector,
representing the ith dimension of the first vector,
an ith dimension representing a second vector;
s402: by the formula
Computing a non-linear mapping to (0)1) a mapping value of the interval;
s403: and solving the linear distance from 1 according to the mapping value.
As described in steps S401 to S403, since the first vector and the second vector have the same dimension, each dimension may be separately calculated and then integrated to obtain a first similarity value, the calculation of the similarity may use as many input values as possible to reduce the calculation loss of the function and improve the calculation effect, the mapping value of each first similarity value in the interval (0, 1) may be calculated by the sigmoid function, and the linear distance from 1 may be obtained from the mapping value by subtracting the mapping value from 1.
In one embodiment, after the step S5 of comparing the linear distance with a set threshold and taking the end position of the target sentence as an initial truncation point when the linear distance is higher than the set threshold, the method further includes:
s601: acquiring a first text distance from each initial interception point to the head end of the article and a second text distance from each initial interception point to the tail end of the article;
s602: according to the formula
Calculating the position score of each initial interception point, wherein K is the position score, X is the first text distance, and Y is the second text distance;
s603: and selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each initial truncation point.
As described in the above steps S601-S603, when there are a plurality of initial truncation points, the position of each truncation point in the article, i.e., the first text distance and the second text distance, may be considered, and then truncation is preferably performed at the center position of the article, so the position of the initial truncation point may be scored, i.e., a position score, and then according to a formula
And calculating the position score of each initial interception point, then carrying out comprehensive calculation according to the position score and the first similarity value, and selecting a preset number of initial interception points as target interception points.
In an embodiment, the step S603 of selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each of the initial truncation points includes:
s6031: recording a set formed by all the initial truncation points as a first set;
s6032: selecting a set formed by the preset number of initial interception points from the first set and recording the set as a second set;
s6033: calculating the score value of each second set through a calculation formula; wherein the calculation formula is
w and m are respectively preset weight parameters; h is
1 ,h
2 ,…,h
n A first similarity value corresponding to an element in the second set; Δ R
i The difference between the first similarity values corresponding to two elements selected from the second set for the ith group; n represents the number of elements in the second set, and F (n) represents the score value;
and selecting the second set with the highest score value, and taking the initial truncation point in the set as the target truncation point.
As described in the above steps S6031-S6033, a set formed by the initial truncation points is recorded as a first set, when the article is long, the number of the initial truncation points is large, and the number of the required target truncation points is correspondingly large, so that different combinations can be screened from the first set as a second set according to the number of the required truncation points, that is, a preset number, and then the score of the second set is calculated by using a formula, and then different weight coefficients w and m are given to the first similarity value and the second similarity value.
In an embodiment, before the step S603 of selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each of the initial truncation points, the method further includes:
s6021: splicing the first sentence vector of each sentence in the article to obtain an article vector of the article;
s6022: searching the preset number corresponding to the target interception point in a preset list according to the dimensionality of the article vector; and the preset list comprises the corresponding relation between the dimensionality of the article vector and the preset number of the target interception points.
As described in the foregoing steps S6021-S6022, the first sentence vector of each sentence in the article is spliced to obtain the article vector of the article, and at this time, the preset number of the target interception points may be searched in a preset list according to the length of the article vector, where the preset list is a corresponding relationship between the preset number of the preset target interception points and the length of the article vector.
In one embodiment, the step S1 of inputting each sentence in the article into the bert model to obtain a plurality of word vectors corresponding to each sentence, and inputting the word vectors into the bidirectional long-short term memory network in the form of word vector sequences to obtain a first sentence vector and a second sentence vector corresponding to each sentence includes:
s101: preprocessing the sentences, and establishing a TOKEN list according to the positions of the sentences in the article to record the positions of the sentences, wherein the preprocessing comprises removing punctuations in the question, unifying languages and deleting irrelevant words and sentences, and the irrelevant words and sentences comprise greetings, adjectives and dirty words;
s102: reading text data of a data set through the bert model, and constructing the word vector in a manner of fine-tuning through the bert model, wherein the bert model is trained on a word database;
s103: and forming the word vector sequence by the word vectors according to the sequence in the sentence, sequentially splicing the word vectors according to the word vector sequence to form a first sentence vector, and sequentially splicing the word vectors in a reverse order to form a second sentence vector.
As described in steps S101-S103, in order to simplify the generated sentence vector and discard other irrelevant influence factors, the sentences may be preprocessed, punctuation marks and irrelevant words and sentences are deleted, the languages are unified, and then a TOKEN list is established, which is intended to mark each sentence, so as to facilitate the calculation of each sentence in the subsequent process, and avoid the confusion of the occurrence position. And then constructing word vectors through a bert model, and then sequentially splicing the word vectors according to the word vector sequence and sequentially splicing the word vectors in an inverted sequence to form a first sentence vector and a second sentence vector.
In another embodiment, the step S5 of comparing the linear distance with a set threshold and taking the end position of the target sentence as an initial truncation point when the linear distance is higher than the set threshold includes:
s601: calculating a second similarity value of target sentence vectors of two adjacent sentences of each initial truncation point;
s602: extracting the initial breaking point of which the second similarity value is smaller than a preset similarity value as a first breaking point;
s603: and screening out a target truncation point from the first truncation points according to a preset rule, and truncating the article through the target truncation point.
As described in the foregoing steps S601-S603, it may also be possible to calculate a second similarity value of the target sentence vectors of two adjacent sentences for further determination, when the linear distance satisfies the initial truncation point larger than the set threshold, then calculate a second similarity value of the two adjacent sentence vectors of the initial truncation point, then extract the initial truncation point with the second similarity value smaller than the preset similarity value as the first truncation point, and then truncate the article by using a preset rule, for example, selecting the first truncation point with the smallest second similarity as the target truncation point, thereby completing the segmentation of the article.
Referring to fig. 2, the present invention further provides an article truncation point setting apparatus, including:
avectorization module 10, configured to input each sentence in an article into a bert model to obtain a plurality of word vectors corresponding to each sentence, and input the word vectors into a bidirectional long-term and short-term memory network in a word vector sequence form to obtain a first sentence vector and a second sentence vector corresponding to each sentence, where the first sentence vector is formed by sequentially splicing the word vector sequences, and the second sentence vector is formed by sequentially splicing the word vector sequences in a reverse order;
avector splicing module 20, configured to splice the tail end of the first sentence vector of each sentence with the head end of the second sentence vector to obtain a target vector of each sentence;
a weighting and calculatingmodule 30, configured to select a target sentence from the article, perform weighting and calculation on a target vector corresponding to each sentence from a head end of the article to a tail end of the target sentence to obtain a first vector, and perform weighting and calculation on a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector; wherein the dimension of the first vector is equal to the dimension of the second vector;
the first similarityvalue calculation module 40 is configured to perform similarity calculation on a first vector and a second vector corresponding to the target sentence, perform sigmoid nonlinear mapping on the calculated first similarity value to a (0, 1) interval, and find a linear distance from 1;
and an initial truncationpoint setting module 50, configured to compare the linear distance with a set threshold, and when the linear distance is higher than the set threshold, take the tail position of the target sentence as an initial truncation point.
Each sentence in the article is input into the bert model to obtain a plurality of word vectors corresponding to each sentence, wherein the division of the sentences in the article is performed through sentence dividing symbols, namely, the content from the beginning of the article to the first sentence dividing symbol is a sentence, the content between the sentence dividing symbols is a sentence, wherein the sentence dividing symbols can be sentence dividing symbols expressed by Chinese or English, and the sentence dividing symbols can be sentence dividing symbols expressed by sentence numbers, exclamation marks, question marks and other symbols. The bert models can be trained based on the corpus databases of different categories to obtain different bert models, then the corresponding bert models are selected according to the categories of the articles to be input, and the word vectors generated by the models are better because the corresponding bert models are trained based on the corpus databases of the corresponding categories.
In order to enable information contained in each sentence to be better calculated, a first vector formed by sequentially splicing word vector sequences and a second vector formed by sequentially splicing word vector sequences in an opposite order can be spliced to form a target vector, loss values of subsequent calculation can be reduced through the target vector, and the result of subsequent similarity calculation is better.
Selecting a target sentence, wherein the target sentence can be selected in turn from each sentence in the article, then weighting and calculating a target vector corresponding to each sentence from the head end of the article to the tail end of the target sentence to obtain a first vector, weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector, wherein the weighting and calculation comprises performing dimension-up calculation or dimension-down calculation and the like on the first vector and/or the second vector, so as to keep the dimensions of the first vector and the second vector consistent, and facilitate the subsequent similarity calculation.
Similarity calculation is carried out on the first Vector and the second Vector, wherein a formula of the similarity calculation can be a WMD (word mover's distance), a simhash algorithm, an algorithm based on cosine similarity, calculation based on an SVM (Support Vector Machine) Vector model and the like, and the similarity of the first Vector and the second Vector can be calculated. Then, in the interval of the first similarity value mapping value (0, 1) obtained by calculation, the similarity value can be embodied in the linear distance with 1, so that the subsequent judgment between the similarity value and the threshold value is facilitated.
Comparing the similarity value with a set threshold value, determining whether the tail end of each sentence meets the initial condition of segmentation, when the initial condition is met, taking the tail end position of the corresponding target sentence as an initial truncation point, and then directly taking the initial truncation point as a final truncation point to truncate the article, when a plurality of truncation points are included, selecting one or more initial truncation points to truncate the article, wherein the selected rule is not limited, for example, the initial truncation points which enable the numbers of the truncated words of each paragraph to be not different as much as possible, or the initial truncation point with the minimum similarity among the plurality of truncation points can be selected to truncate.
In one embodiment, the first similarityvalue calculation module 40 includes:
a first similarity value calculation submodule for calculating a similarity value by a formula
Calculating the first similarity value, wherein,
is a value of the first similarity value and is,
a first vector is represented by a first vector,
a second vector is represented that represents the second vector,
representing the ith dimension of the first vector,
an ith dimension representing a second vector;
a mapping value operator module for passing a formula
Calculating a mapping value of the nonlinear mapping to the (0, 1) interval;
and the linear distance calculation submodule is used for solving the linear distance with 1 according to the mapping value.
Because the dimensionalities of the first vector and the second vector are the same, each dimensionality can be independently calculated, then the dimensionalities are integrated to obtain a first similarity value, the similarity calculation uses input values as much as possible, the calculation loss of the function is reduced, the calculation effect is better, then the mapping value of each first similarity value in the (0, 1) interval is calculated through a sigmoid function, finally the linear distance between the mapping value and 1 is calculated according to the mapping value, and the mapping value is subtracted from 1.
In one embodiment, the device for setting the article interception point further comprises:
the text distance acquisition module is used for acquiring a first text distance from each initial interception point to the head end of the article and a second text distance from each initial interception point to the tail end of the article;
a position score calculation module for calculating a position score according to a formula
Calculating the position score of each initial interception point, wherein K is the position score, X is the first text distance, and Y is the second text distance;
and the target truncation point selecting module is used for selecting a preset number of target truncation points from the initial truncation points to truncate the article according to the first similarity value and the position score corresponding to each initial truncation point.
When there are a plurality of initial truncation points, the position of each truncation point in the article, i.e., the first text distance and the second text distance, may be considered, and then truncation is preferably performed at the center position of the articleThus, the position of the initial truncation point can be scored, i.e., position-scored, and then expressed according to a formula
And calculating the position score of each initial interception point, then performing comprehensive calculation according to the position score and the first similarity value, and selecting a preset number of initial interception points as target interception points.
In one embodiment, the target truncation point selection module includes:
a first set forming submodule, configured to mark a set formed by all the initial truncation points as a first set;
a second set forming submodule, configured to select a set formed by the preset number of initial truncation points from the first set, and record the set as a second set;
the score value operator module is used for calculating the score value of each second set through a calculation formula; wherein the calculation formula is
w and m are respectively preset weight parameters; h is
1 ,h
2 ,…,h
n A first similarity value corresponding to an element in the second set; Δ R
i The difference between the first similarity values corresponding to two elements selected from the second set for the ith group; n represents the number of elements in the second set, and F (n) represents the score value;
and the selection submodule is used for selecting the second set with the highest score value and taking the initial interception point in the set as the target interception point.
The method comprises the steps of recording a set formed by all initial interception points as a first set, when an article is long, the number of the initial interception points is large, and the number of required target interception points is correspondingly large, so that different combinations are screened from the first set as a second set according to the number of the required interception points, namely the preset number, then the score of the second set is calculated through a formula, then different weight coefficients w and m are given to the first similarity value and the second similarity value, it is understood that when the factor influenced by the position score is large, the weight coefficient w can be increased, when the factor influenced by the first similarity is large, the weight coefficient m can be increased, then the score of each initial interception point is calculated, and the target interception points are selected according to the height of the score.
In one embodiment, the first similarityvalue calculating module 40 further includes:
the article vector splicing submodule is used for splicing the first sentence vector of each sentence in the article to obtain an article vector of the article;
a target interception point searching submodule, configured to search the preset number of corresponding target interception points in a preset list according to the dimension of the article vector; wherein the preset list includes a corresponding relationship between the dimensionality of the article vector and the preset number of the target interception points.
The first sentence vectors of each sentence in the article are spliced to obtain the article vectors of the article, at this time, the preset number of the target interception points can be inquired in a preset list according to the length of the article vectors, wherein the preset list is the corresponding relation between the preset number of the target interception points and the length of the article vectors, and the preset number is preset.
In one embodiment, thevectorization module 10 includes:
the preprocessing submodule is used for preprocessing the sentences and establishing a TOKEN list according to the positions of the sentences in the article to record the positions of the sentences, wherein the preprocessing comprises removing punctuations in problems, unifying languages and deleting irrelevant words and sentences, and the irrelevant words and sentences comprise greetings, adjectives and dirty words;
the word vector reading submodule is used for reading text data of a data set through the bert model and constructing the word vector in a net-tuning mode through the bert model, wherein the bert model is trained on the basis of a word database;
and the word vector sequence forming module is used for forming the word vector sequence by the word vectors according to the sequence in the sentence, sequentially splicing the word vectors according to the word vector sequence to form a first sentence vector, and sequentially splicing the word vectors in a reverse order to form a second sentence vector.
In order to simplify the generated sentence vector, abandon other irrelevant influence factors, the sentences can be preprocessed, punctuation marks and irrelevant words and sentences are deleted, the languages are unified, and then a TOKEN list is established. And then, constructing word vectors through a bert model, and sequentially splicing and reversely splicing the word vectors to form a first sentence vector and a second sentence vector according to the word vector sequence.
In another embodiment, the article interception point setting device comprises:
the second similarity value calculation module is used for calculating a second similarity value of target sentence vectors of two adjacent sentences of each initial truncation point;
the second similarity value judgment module is used for extracting the initial breakpoint of which the second similarity value is smaller than a preset similarity value as a first breakpoint;
and the target truncation point screening module is used for screening a target truncation point from the first truncation point according to a preset rule and truncating the article through the target truncation point.
The method can also be used for further judging by calculating a second similarity value of target sentence vectors of two adjacent sentences, when the linear distance meets an initial truncation point which is larger than a set threshold value, then calculating the second similarity value of the two adjacent sentence vectors of the initial truncation point, then extracting the initial truncation point of which the second similarity value is smaller than a preset similarity value as a first truncation point, and then segmenting the article by using a preset rule, for example, selecting the first truncation point with the minimum second similarity as the target truncation point, thereby completing the segmentation of the article.
The invention has the beneficial effects that: the method comprises the steps of weighting and calculating a target vector corresponding to each sentence from the head end of an article to the tail end of a target sentence to obtain a first vector, weighting and calculating a target vector corresponding to each sentence from the tail end of the target sentence to the tail end of the article to obtain a second vector, and calculating the similarity, wherein the information of all sentences is fully considered, and the cut-off points of the article can be selected better.
Referring to fig. 3, a computer device, which may be a server and whose internal structure may be as shown in fig. 3, is also provided in the embodiment of the present application. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used for storing various word vectors and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program can implement the method for setting the article interception point according to any of the above embodiments when executed by a processor.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is only a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for setting an article interception point according to any of the above embodiments can be implemented.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware related to instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, apparatus, article, or method comprising the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.