Disclosure of Invention
Technical problems: based on an artificial intelligence algorithm, a method for complementing the missing characters in the cultural relics is provided, and is used for complementing the missing characters in the texts by combining emotion recognition, context semantics and ancient pronunciations.
The invention comprises the following steps: the invention provides a method for complementing missing characters in a cultural relic, which comprises the following steps:
Step 1, constructing a cultural relic data set;
step 2, constructing a text missing character recognition model for predicting the text missing in the text;
The text missing text recognition model comprises an emotion recognition model, a meaning recognition model, a phonogram recognition model and a transducer encoder;
The output ends of the emotion recognition model, the meaning recognition model and the phonogram recognition model are connected with a transducer encoder;
step 3, training the emotion recognition model, the meaning recognition model and the phonogram recognition model by using the cultural relic data set, and training the cultural relic missing character recognition model integrally by using the cultural relic data set after training;
And 4, inputting the text Wen Yugou containing the missing characters into the trained text missing character recognition model, and predicting the missing characters in the text.
Further, training the emotion recognition model by using a text-to-text data set in the step 3, specifically, inputting a text Wen Yugou containing a missing text into the emotion recognition model for emotion recognition, and outputting the emotion tendency of the text Wen Yugou containing the missing text;
Further, the inputting the text Wen Yugou containing the missing text into the emotion recognition model for emotion recognition includes the following steps:
step 311, the left text of the missing text is denoted as Senleft, the right text of the missing text is denoted as Senright, and an Emb (& gt) text encoding operation is performed on Senleft and Senright respectively, so as to obtain two encoding tensors left and right, which are denoted as:
left=Emb(Senleft)
right=Emb(Senright)
step 312, inputting the encoded tensor left and right to bi_lstm (i.e.) two-way long-short-term memory network, and performing feature extraction to obtain:
outl=Bi_LSTM(left)
outr=Bi_LSTM(right)
Splicing outl and outr, and outputting through a Softmax activation function to obtain the emotion tendencies of the text Wen Yugou containing the missing characters
emotion=Softmax(Cat(outl,outr))
Wherein Cat (-) represents a stitching operation of the two feature vectors; softmax (-) is the activation function for the final classification.
Further, in step 3, the semantic recognition model is trained by using a dialect data set, specifically, a dialect Wen Yugou containing a missing word is input into the semantic recognition model, semantic recognition is performed, and a semantic vector semantic of the missing word in the dialect sentence is output.
Further, the semantic recognition model adopts a bidirectional LSTM model.
Further, training a phonological recognition model by using a dialect text dataset in the step 3, specifically, inputting the dialect Wen Yugou containing the missing characters into the phonological recognition model, performing pinyin recognition, and outputting the pinyin with the tone of the missing characters;
Further, the inputting the text Wen Yugou containing the missing text into the phonogram recognition model for the pinyin recognition comprises the following steps:
step 321, word vector coding of Word2Vec and Huffman tree algorithm are adopted to carry out Pinyin recognition on the text Wen Yugou containing the missing characters, so as to obtain Pinyin information of the missing characters; the pinyin information does not include tones;
Step 322, encoding the input text Wen Yugou by Embedding Word codes to obtain an encoded vector word_emb:
Word_emb=Emb(Sen)
Step 323, inputting the code vector word_emb to a Bi-directional long-short-term memory network Bi_LSTM, and extracting features to obtain a feature vector Temp, wherein the Bi-directional long-term memory network Bi_LSTM is set to be 7 or 17;
Temp=Bi_LSTM(Word_emb)
Step 324, the extracted feature vector Temp is continuously sent to a transducer network to extract global information, and the tone of the missing text is output
tone=Transformer_Layer(Temp)
Step 325, combining the tone of the missing text with its pinyin to obtain the pinyin Pinyin with tone of the missing text.
Further, in step 3, the whole training of the text missing word recognition model is specifically that emotion tendencies, semantic vectors and tonal pinyin of the missing word, which are correspondingly output by the emotion recognition model, the meaning recognition model and the phonological recognition model, are input to a transducer encoder, so that the missing word of the text is predicted.
Further, the emotion tendency emotion, the semantic vector semantic and the tonal pinyin of the missing text which are correspondingly output by the emotion recognition model, the meaning recognition model and the phonological recognition model are input to a transducer encoder, so that the missing text is predicted, and the method comprises the following steps:
Step 331, respectively outputting a semantic vector semantic, emotion tendency information and the pinyin Pinyin with tone to a embedding layer for encoding, where the obtained encoding vectors are expressed as follows:
Word_emb1=Emb(semantic)
Word_emb2=Emb(emotion)
Word_emb3=Emb(Pinyin)
Step 332, splicing the encoded vectors obtained in step 331, which is expressed as follows:
input=Cat(Word_emb1,Word_emb2,Word_emb3)
wherein Cat (-) represents a stitching operation on the feature vector;
step 333, sending the fused tensor input to a transducer encoder for feature extraction, predicting the missing text and outputting
Output=Transformer(input)
The beneficial effects are that: in the process of predicting the missing characters, the invention extracts and fuses emotion, semantics and voice, thereby improving the efficiency and quality of the missing characters.
Detailed Description
The existing work of the character deficiency is mainly concentrated on semantic features, but as many literary works in ancient Chinese are works of authors under different emotions, the emotion is a feature with important information; in addition, the ancient Chinese language has a large number of writing specifications of voice antithesis and a flat tone, so the voice is also an important information characteristic. The current research does not extract the emotion and voice characteristics and does not fuse the emotion and voice characteristics with semantic characteristics, so that a great amount of characteristic information is lost, and the quality of the deficiency is reduced. Therefore, the 3-dimensional features of emotion, voice and semantics are extracted and fused, the width and accuracy of the filling can be improved, and the method is very convenient for word specialists to operate and has innovative and practical significance. In addition, the invention submits the respective results of the original emotion, semantic and voice analysis and the final results to expert users, and the expert can analyze and adopt from various angles.
The theoretical basis of the invention has 3 starting points:
(1) Semantic meaning
The Chinese language is ideographic words, each word can express various meanings according to the context environment, so that the understanding of each word needs to be understood by combining the context, and intermediate words can be guessed through the context words, and therefore, a Markov chain of local context information exists on the word information.
(2) Speech sound
Because Chinese characters have a plurality of polyphones, ancient people pay attention to the rule of the pronunciation direction of the flatly-sounding and rhyming in writing, and can provide the information of guessing missing words in pronunciation dimension besides the upper and lower characters. There is also a markov chain of local context information on the word pronunciation.
(3) Mood of emotion
The literary works are emotions, which are emotions of the author to a certain state, which affect the text usage of the whole work, which is a globally affecting feature.
Based on the 3 factors, the invention provides a deep learning method for the literary and literal deletion completion task, which adopts a 2-level multi-branch structure; the first stage is to predict emotion, semanteme and tone phonetic alphabet of the character to be supplemented; the pinyin with tone refers to one, two, three or four sounds in the pinyin; the second level is to fuse emotion, semantics and tone pinyin. After fusion, a group of possibly interpolated characters are output for the proposal of the character research worker to refer to.
Because the invention considers that expert users need to know the results of the independent analysis of the angles in actual work, thereby facilitating the final comprehensive analysis of the expert users, the models requiring the angles can be analyzed independently. In consideration of the reason, the whole model designed by the invention needs that the 3 angles can be independently learned, respectively analyzed and respectively output, so that a scheme which is finally fused at a result level is selected from three fusion schemes of a pixel level, a feature level and a result level in the data fusion field. The invention relates to a method for complementing missing characters in a cultural relic, which comprises the main steps of data set construction, learning process, using process, feedback process and the like.
Step 1, data set construction
Step 1.1, constructing a cultural relic data set
The ancient electronic dataset Chinese-poetry (https:// gitsub. Com/Chinese-poetry/Chinese-poetry) disclosed on Github is also available for self-purchase download via, for example, the CSDN website. The dataset includes discourse, poetry, the Four Books five meridians, mongolian, 5.5 ten thousand Tang poems, 26 ten thousand Song Shi, and 2.1 ten thousand Song words. Each sentence carries punctuation marks marked by the latter person, and the punctuation marks exist as a word.
In addition, the text-to-text data set also comprises pronunciation data corresponding to the text-to-text; considering that the public commercial pronunciation software fully learns the pronunciation rules of polyphones in pronunciation, the pronunciation of the dialect is automatically generated by a computer, mandarin is used, and the public commercial pronunciation software of the mass science has automatically read the semantics of a Chinese-poetry dataset to form a group of pronunciation. The pronunciation data can be stored in the cultural relic data set, and the MP3 format sound data corresponding to the lossless standard can be generated by using the pronunciation software when a certain sentence in the cultural relic data set is used each time.
The Chinese language is stored in a database in a coding mode, the coding standard is GB18030-2022, the coding standard is an upgrade version of GB18030 issued in 2022, the Chinese character coding national standard is expansion of GBK coding, the Chinese language, japanese language and Korean language are covered, and the total number of Chinese characters is recorded by more than eight tens of thousands.
When each model is trained independently, the existing words in the existing Chinese-poetry dataset are hidden randomly, and the emotion value, the semantic value, the spelling value and the flat pitch which originally exist are used as learning output to learn paired data
Step2, learning process
Principle part:
The existing Natural Language Processing (NLP) field has a common disadvantage in the english application direction: sentences in a pair of contradictions are not necessarily from different semantic categories. Therefore, merely optimizing the implicit and contradictory objective functions of reasoning does not adequately capture higher-order semantic features, i.e., it is not possible to represent more fine-grained semantics. This disadvantage is due to the fact that local, i.e. finer granularity, losses can only be learned from a single sentence pair or triplet, resulting in a poor local optimal solution.
The principle of the section is combined with the characteristics, namely if the model only focuses on the low-order characteristics in a local range, the contradictory information of the low-order characteristics is easily amplified, and the model falls into local optimum; if the model is only focused on the high-order features in the global scope, a lot of detail information is lost, and the optimal solution cannot be found. Therefore, the method can be combined with each other, and the high-order features are used for guiding the extraction and learning of the low-order features, so that a better balance effect is achieved.
Considering that the principle of natural language processing NLP is approximately the same, the same working principle exists for the subject text of the present invention. If the characteristics of multidimensional features are considered, the invention proposes and uses the emotion which is a high-order feature, and the feature extraction and the learning are carried out with the semantic and voice which are two low-order features.
In addition, the invention considers that the expert user needs to know the results of the independent analysis of the angles in the actual work, thereby facilitating the final comprehensive analysis of the expert user, and the models requiring the angles can be analyzed independently. In view of this, as shown in fig. 1, the overall model designed by the present invention requires that these 3 angles can be analyzed independently, learned independently, and a final fusion scheme at the result level is selected from three schemes of the pixel level, the feature level and the result level in the data fusion field.
In the whole structure, as shown in fig. 1, the model provided by the invention is divided into 2 stages, wherein the 1 st stage is emotion recognition, meaning recognition and phonogram recognition; stage 2 is to perform a transition-based alignment operation based on the prediction results of these high-and low-order features. Because the emotion features belong to a global high-order feature, and the context and context extraction features belong to a local low-order feature, the invention uses emotion, semantic and voice multiple features to extract and learn, and hopes to achieve a better balance effect.
Emotion recognition is a specialized field, with specialized data sets and algorithms, and the present invention uses existing maturity algorithms.
Step 2.1 emotion recognition model.
The emotion recognition model adopts a recognition scheme of a bidirectional loop LSTM of a short text emotion analysis algorithm research based on deep learning in a Gao's paper of Leng Yongcai of North electric university in 2021 to recognize emotion in a Chinese text; on the basis, the invention sets the identified emotion values as follows: 8 kinds of results, 0 (unknown), 1 (very negative), 2 (negative), 3 (little negative), 4 (neutral), 5 (little positive), 6 (positive), 7 (very positive). The numerical values of 0 to 7 are input to the lower level and used as the high-order semantic to conduct guiding learning on the low-order semantic.
By analyzing chinese-poetry data sets, the reasonable balance area of sentence length and emotion accuracy is found to be 28-35 characters, so that the number of LSTM in the bidirectional loop LSTM in the emotion recognition model is set to be 35, namely 17 characters before and after the Chinese character to be recognized (if one side of the characters positioned at the beginning and the end of the article is less than 17 characters, 0 is supplemented, in addition, punctuation marks are also used as character input), and the method is one parameter setting in the emotion model combining process. Thus, the present invention takes 35 words as the length of one sentence, recognizes the overall emotion of the sentence of surrounding 35 words centering on each word, and assigns the overall emotion value to the emotion feature of this word.
By using the data set of emotion recognition in the short text emotion analysis algorithm research based on deep learning in the prior art of the university of electric power at north in 2021, leng Yongcai, as input, a proper emotion recognition model M1 can be trained, and then the emotion recognition model M1 is used for performing emotion recognition on the input text data TexT1, so as to obtain an output result of a group of emotion recognition models, namely an emotion result OutT1.
The invention regards the characters to be supplemented as a blank value, the emotion expression value of the characters to be supplemented is calculated by 34 values around, and the value is filled in the emotion result OutT sequence of the emotion recognition model M1. If 34 Chinese characters before and after the Chinese characters to be supplemented are deleted are directly used for identifying the emotion of the Chinese characters to be supplemented, the calculation cost is high, so that the scheme of the invention is that.
Step 2.1-A, namely, 34 characters in total are arranged before and after the characters to be supplemented, each character and 2 characters before and after the characters form a 3-character phrase, the emotion value of 1 character before and after the 34 characters (the last 2 characters or the first 2 characters are used for the head and tail characters) is identified,
Step 2.1-B then gives the emotion value sequence OutT1 of 34 words to the network, and then identifies the emotion value of the missing word to be supplemented. Thus, with a 2-stage scheme, the computational effort can be greatly reduced.
The output content of the emotion recognition model M1 is an emotion expression value, outT 1= { Oa11,Oa12,…Oa1LTi,…,Oa1LT1 }, which is consistent with the text length LT1 of the input TexT of the emotion recognition model M1, where lti=1 to LT1, LTi represents the i-th word, and lt1=35 is adopted in the embodiment of the present invention. Oa12 is the emotion result of the 2 nd word in the text, and the value is between 0 and 7, and the 8-class results of 0 (unknown), 1 (very negative), 2 (negative), 3 (little negative), 4 (neutral), 5 (little positive), 6 (positive) and 7 (very positive) are respectively indicated.
After such arrangement, outT's 1 output is that surrounding words centered on each word form a sentence, then the whole emotion of the sentence is recognized, and then the whole emotion value is given to the position of the word, so as to form a high-order feature sequence with emotion.
The emotion recognition model M1-the specific scheme of the bidirectional loop LSTM is as follows: senleft and Senright are left and right texts of a missing word respectively, emb ()'s are Embedding text coding operations, as shown in FIG. 2, refer to paper EFFICIENT ESTIMATION OF WORD REPRESENTATIONS IN VECTOR SPACE, which is a text coding mode, coding is realized by using low-dimensional vectors, and the coding is optimized through neural network training, so that the relevance between characters can be expressed. Embedding have two parameters in common, the first parameter being the maximum number of dictionaries, i.e. the size of the dictionary, and the second parameter being the dimension of the desired output vector.
In a specific embodiment of the invention, for step 2.1-A, the word "hoe Dai Dang" is in the form of: "hoe", "grass", "day", "sweaty", "noon". Each word is sent to embedding layers for encoding, where the specified output dimension is 3, and a real vector of each word is obtained after training of the model, for example, the vector encoded by "hoe" may be: [0.2,0.4, -0.1], thus yielding a low-dimensional vector representation of the text, the specific dimensions can be set by predefined.
left=Emb(Senleft)
right=Emb(Senright)
Since this step becomes computationally intensive as dimensions increase and the computational requirements become high for 35 dimensions, the invention is divided into 2 steps, step 2.1-a using only 3 words to extract the low-dimensional vector representation of the words. Wherein step 2.1-B is OutT1 obtained directly from step 2.1-A, thus reducing the amount of calculation.
Here, left and right texts Senleft and Senright of the missing text are encoded to obtain two tensors left and right
The Bi-directional loop LSTM in emotion recognition model M1 is Bi_LSTM (two-way long-short-term memory network) which is formed by combining forward LSTM and backward LSTM. But the LSTM cells required in these two steps are not identical. The LSTM cells in step 2.1-A are 3, and the inputs are two tensors left and right. The LSTM cells in step 2.1-B were 35 and their inputs were OutT < 1 >.
I.e., both front and back, are typically used to model context information in natural language processing tasks. The LSTM network is proposed in paper Convolutional LSTM Network: A MACHINE LEARNING Approach for Precipitation Nowcasting for the first time, the coding tensor of the left and right texts of the missing word is obtained after the coding operation, the two tensors left and right are respectively subjected to feature extraction through a Bi-directional long-short-term memory network Bi_LSTM (), the coding tensor is input into the Bi_LSTM (), then the forward LSTM firstly reads and inputs the words from front to back, and the reverse LSTM reads and inputs the words from back to front. The two-way long and short-term memory network includes a plurality of LSTM units, each of which maintains a hidden state that indicates its understanding of the current context. And finally, combining the hidden states in the two directions to obtain comprehensive understanding of the context information of the whole sentence.
outl=Bi_LSTM(left)
outr=Bi_LSTM(right)
After the feature extraction operation of the two-way long-short-term memory network of the left branch and the right branch, the obtained feature vectors are fused, and then the emotion tendencies of sentences can be obtained through the output of the Softmax activation function
emotion=Softmax(Cat(outl,outr))
Where Cat () represents the concatenation of two feature vectors. softmax (-) is an activation function for the final classification, intended to score each node for the previous linear classification, which scales each element in the output vector to a value between 0 and 1, and all values add to 1. And finally, outputting the position index with the maximum probability value and at the uniform speed by the model, namely the predicted emotion type. Here, in step 2.1-a, the emotions of the single words composed of 3 words are input into OUT1 one by one, and in step 2.1-B, the emotions of the missing values obtained by 35 words (may be reduced according to the actual situation) are determined.
The' hoeing day is the morning, the sweat drops off the soil, who knows the Chinese meal, and the grains are all hard. By way of example, assuming that the missing word is "sweat", the procedure of this scheme is as follows:
In the step 2.1-A, extracting 'hoeing day', 'He daycare', 'daycare noon', 'Wu, 0' … respectively for identification, and using Bi_LSTM of 3 LSTM units, wherein each word in the whole text has one emotion identification, so as to obtain OUT1;
In step 2.1-B, bi_LSTM of 35 LSTM units is used to identify the input OUT1, and the emotion value of the missing value is obtained. In this example, of the 35 LSTM's, the left 17 LSTM's use only 6, corresponding to the emotion value of 1 word around the "hoeing daycare", the remaining 11 voids are filled with 0 "unknown"; the 17 LSTMs on the right use 17, which corresponds to' drop grass is fallen down to know who has eaten the Chinese meal, and the grains are all hard. "emotion value of 1 word around 17 words.
Therefore, on the premise of limited calculation amount, the invention is divided into 2 main steps based on Bi_LSTM structures (3 and 35 LSTM modes), and the 35-dimensional calculation amount is changed into 34 3-dimensional calculation amounts, so that emotion recognition project of words in a 35-word range is obtained.
Thus, the M1 model can work independently, output emotion sequences and input the emotion sequences into later fusion. The method meets the requirement that expert users need to know the results of independent analysis of the angles in actual work, thereby facilitating the final comprehensive analysis of the expert users.
Step 2.2, meaning identification model
Here, we continue to use the scheme of bi-directional LSTM in emotion recognition model for meaning prediction. Considering that the absolute sentence in ancient text is many 7 words, the minimum unit used is 1 word, so the first easily conceivable relationship length is 7 words.
The method is as follows, for example, "according to the wine of" Kanji "and" according to the 1 st LSTM "at the moment, the input value of" according to the Kanji "and" according to the 2 nd LSTM "and" according to the number of "according to the Kanji" and "according to the GB 18030-2022", the input value of "according to the 3 rd LSTM" and "according to the number of" according to the text of "according to the Kanji" and "according to the GB 18030-2022", the input value of "according to the 4 th LSTM" and "according to the 5 th LSTM" and "according to the number of" according to the Kanji "and" according to the GB18030-2022 "and" according to the 7 th LSTM "and" according to the wine ".
From this example, it is not difficult to see that the missing "out" in the middle is more prone to be combined with the following "wine", so as to form a common two-word combination of "drunk", "out of alcohol", "drunk", and the like; also combined with the previous 'persuasion' words to form a 'urge the guests to drink' two-word combination, so the scheme of using the bidirectional LSTM is reasonable, the front and back 3 words are reasonable, and the length is 7.
In addition, antithesis of many phrases, for example, "two Orunder green willows," in the first blank on one row of aigrette, "two" and "one", "one" and "row" are antithesis, and their pitches are 7 kanji plus one punctuation mark, so their distances are 8, so the 2 nd relationship distance that is easily conceivable is 8×2+1=17.
Since it is reasonable to set the relationship length to 7 or 17, a larger range of 17 can be selected as the maximum distance, so that the relationship between them can be grasped as much as possible.
However, if 17-dimensional data is used for analysis, as in step 2.1, the calculation is significant, so that the word required for antithesis can be provided to the 8 th distance, i.e., the 8-dimensional, on the basis of 7 distances in order to minimize the calculation. In the example, if the missing value is "line", it is "one" of antithesis plus the front and back 3 words "one" and one aigrette. As to whether the text is a 5-way clause or a 7-way clause, the input can be completely provided by the researcher as the invention is an auxiliary algorithm provided to the researcher.
Based on this analysis, plus the fact that the model of the bi-directional LSTM using 8 LSTMs, which easily discovers detailed information that is not easily discoverable by humans, is chosen for better feature extraction performance due to the data sets that have been created herein and the adoption of the bi-directional LSTM model scheme in emotion model recognition.
The specific scheme of the partial model M2 is consistent with that of the emotion recognition model M1, so that details are not repeated, and the output result is Sen.
Thus, the M2 model can work independently, output emotion sequences and input the emotion sequences into later fusion. The method meets the requirement that expert users need to know the results of independent analysis of the angles in actual work, thereby facilitating the final comprehensive analysis of the expert users.
Step 2.3 Table Voice recognition model
Step 2.3.1, pinyin identification
Word vector coding of Word2Vec is adopted, and a Huffman tree algorithm is combined, as shown in fig. 2, a standard CBOW (continuous Word bag model) model comprises Embdding layers, hidden layers and an output layer, and outputs of the hidden layers are subjected to normalization processing after being spliced and then are output through the output layer. Then, since the first-level characters commonly used in the Chinese characters are approximately 3500 characters, if the softmax output of the traditional full-connection layer is adopted, the calculation amount is large, so that the coding format of the Huffman tree is adopted for reducing the calculation amount. Such combinations are well known and will not be described in detail herein, the details of which may be found in
Parameters are referred to on (https:// blog. Csdn. Net/qq_ 45198339/arc/details/128772164).
After this step, pinyin information (e.g., chinese "i" where wo, no 3 rd sound information is obtained) of the target chinese character to be identified that does not include flat tones (tones) may be obtained. Then the result is input into 1.3.2 Pingzep tone recognition, and combined with Pingzep tone to form Chinese phonetic alphabet with Pingzep tone.
The reason for this design is that the existing scheme is relatively approved, and the code and the model are disclosed, so that the method can provide better Pinyin information without flat tones in effect.
Step 2.3.2 Flat-tone recognition
The phonological model mainly considers antithesis of upper and lower tones in ancient words, especially flat tones, and the cultural words basically have the following rules:
(1) From each sentence, the plains are alternate; from each union, the level is opposite, and the upper union post sentence is the same as the lower union pre sentence.
(2) Three consecutive calms cannot occur at the end of each sentence.
(3) The flat-up and flat-down sentence in the five words (the second word is flat and the last word is flat), and the first, third or fifth word in the zepe-up and flat-down sentence in the seven words (the second word is zepe and the last word is flat) has to have one flat sound, otherwise is 'solitary flat'.
In addition to the above, there are also some rules of flat tones. These rules can be processed either with a purely model approach or with a deep learning approach. The scheme of adopting the pure model can lead the rules to be more detailed, but obviously the volume of the model is large, and the detail information which is more difficult to be found by people is not easy to be identified, and the bidirectional LSTM model is adopted as the voice recognition model M3 under the maximum 17 character distances as the semantic model because the established data set and the scheme of adopting the bidirectional LSTM model in emotion model recognition are considered, and the characteristic of flat and unified appearance of the last sentence end word of lushi (a poem of eight lines, each containing five or seven characters, with a strict tonal pattern and rhyme scheme) is considered.
In this way, the present invention recognizes emotion from a total of 8 words, and then predicts the sound of the missing word by using M3, and the present invention uses only four sounds of sound as a result. The following uses one of 5 types of results, i.e., 0 (unknown), 1 (-tone), 2 (-tone), 3 (-tone), 4 (-tone), and the like, and in this embodiment, the input is only 5 types, not 3500 words or more, which are similar to the meaning of characters, although the model is at the maximum 17-word distance, and therefore the calculation amount is significantly reduced. The specific scheme is shown in fig. 3:
As in M1, for example, "two Orunder green willows, one row of aigrette on the green sky". The form of the poem after word segmentation is as follows: "two (3 (v) sounds)) are only (4 ((d)) yellow (2 (d) sounds)) voice) to sound (2 (d/a) voice) green (4 (d/a)) willow (3 (d/a)) and a method for producing the same, (0 (unknown)) one (1 (-tone)) row (2 (d/a tone)) is (2 (d/a tone)) on (4 (d/a tone)) green (1 (-tone)) on (4 (d/a tone)) for d (1 (-tone)). (0 (unknown)) "Pinyin (3422245012244110);
Assuming that the missing word is a "row," then (3422245010244110) this sequence is input into the 35 LSTM cells in model M1; the 35 LSTM units form Bi_LSTM (), are two-way long-short-term memory networks, and then the encoded sentences are sent into the two-way long-term memory networks (Bi_LSTM) to perform feature extraction:
Temp=Bi_LSTM(Word_emb)
where word_emb=emb (Sen) represents the encoded sentence;
The extracted feature vector Temp is continuously sent into a transducer network to extract global information, and finally the tone of the missing text is output
tone=Transformer_Layer(Temp)
Step 2.3.2, combining the tone of the missing text with the pinyin thereof to obtain the pinyin Pinyin with the tone of the missing text;
Thus, the M3a and M3b models can work independently to output pinyin pingyin and flat tone tone_pz, and can also be combined into a complete tone with pinyin to be input into later fusion. The method meets the requirement that expert users need to know the results of independent analysis of the angles in actual work, thereby facilitating the final comprehensive analysis of the expert users.
Step 2.4, synthesizing an output module
After obtaining the prediction results provided by the 3 dimensions respectively, a synthesis output needs to be performed on the results, and the process includes coding splicing and predicting 2 parts, and is shown in fig. 4 as a whole.
Step 2.4.1, coding and splicing.
The original sentence and the emotion, semantic and tone pinyin Pinyin predicted before are respectively encoded, emb (word) is Embedding word encoding operation, and the text EFFICIENT ESTIMATION OF WORD REPRESENTATIONS IN VECTOR SPACE can be referred to, which is a word encoding mode, encoding is realized by using low-dimensional vectors, and the encoding is optimized through neural network training, so that the relativity between words can be expressed.
Word_emb1=Emb(semantic)
Word_emb2=Emb(emotion)
Word_emb3=Emb(Pinyin)
Wherein semantic, the epotion and Pinyin respectively represent the results of the meaning model recognition, the emotion recognition and the phonogram model recognition.
Then, splicing the obtained coding vectors to obtain a coding tensor input:
input=Cat(Word_emb1,Word_emb2,Word_emb3)
Where Cat () represents the concatenation of two feature vectors.
The tensor input after fusion is sent to a transducer for feature extraction, and the characters missing in the final ancient poems are predicted
Output=Transformer(input)
The transducer model was first proposed in google published paper Attention is All You Need, where the parameters were set as follows:
length of input vector: 8, 8;
Feedforward neural network hidden neuron number: 2048;
dimension of query, key, and value vector: 512.
Number of stacks of modules: 12;
Number of attention heads in multi-head attention: 8
[1] Input representation (Input Embedding):
First, each word or token in the input sequence is converted into a vector representation. Word embedding (word embedding) is typically used to represent words or tokens in an input sequence. The word coding method is a word coding mode, coding is realized by using low-dimensional vectors, and the coding is optimized through neural network training, so that the relevance between words can be expressed. Embedding have two parameters in common, the first parameter being the maximum number of dictionaries, i.e. the size of the dictionary, and the second parameter being the dimension of the desired output vector.
The form of the poetry divided by words, such as 'hoeing the grass on the day' is as follows: "hoe", "grass", "day", "sweaty", "noon". Each word is fed into embedding layers for encoding, where the specified output dimension is 3, and a real vector of each word is obtained after training of the model, for example, the vector encoded by "hoe" may be: [0.2,0.4, -0.1], this results in a low-dimensional vector representation of the word, the specific dimensions being set by predefined definition.
Here, prediction is performed using only a transducer encoder section, which is mainly stacked by the same attention module (transducer_layer), and the process of each module can be expressed as the following:
[2] Self-Attention calculation (Self-Attention):
At the heart of the transducer encoder is a self-attention mechanism that allows the model to build associations between different positions in the sequence. The input sequence is subjected to three linear transformations to obtain representations of a query (query), a key (key) and a value (value).
An attention weight is calculated using the query vector for measuring the relevance of each position in the input sequence to the query position. This may be achieved by calculating the inner product of the query vector and all key vectors.
Attention weights are applied to the value vectors to obtain a weighted sum representing a contextual representation related to the query location. The calculation of the self-attention mechanism can be expressed as the following formula:
Where Q, K, V represent the query vector, the key vector and the value vector,Representing the dimensions of the key vector, softmax (i.e.) represents the activation function.
[3] Multi-Head Attention (Multi-Head Attention):
the transducer model uses multiple independent self-attention mechanisms, called multi-head attention. Each attention header performs a different query, key, and value linear transformation, allowing the model to capture different information in different presentation subspaces. The flow of multi-head attention can be divided into the following steps:
dividing the input sequence data into a plurality of heads;
Performing independent query, key, value linear transformation for each header;
Performing a self-attention calculation on each head, resulting in an output of the head;
splice the outputs of all heads together and perform an output linear transformation.
The outputs of the multiple heads of attention are connected and linearly transformed to generate a final attention representation.
[4] Layer normalization (Layer Normalization):
After self-attention calculation and multi-head attention, a layer normalization operation is performed. This normalizes the attention representation making it easier to train and characterize.
[5] Feed forward neural network (Feed-forward Neural Network):
In each of the attention modules, a feed-forward neural network is also included. It performs a nonlinear transformation on the attention representation of each location to enhance the representation capabilities of the model.
[1] Residual connection (Residual Connections) and layer normalization (Layer Normalization):
in each attention module, residual connection and layer normalization are used to enhance the flow and gradient propagation of information.
In the prediction result, the original scheme is to fill the characters with the largest prediction probability as missing characters, however, because of subjectivity of text content semantics, the characters with the largest prediction probability are not necessarily the most suitable, so that all characters with the highest prediction probability are output in the last selection, the predicted five characters and output values of M1, M2, M3a and M3b models are provided for a user to perform reasoning selection to be the most suitable for filling as missing characters through artificial selection intervention, and the fluency of the text and the semantics can be greatly improved.
On the aspect of the problem of a data set, when each model is independently trained, the invention randomly hides the words existing in the existing Chinese-poetry data set, uses the original emotion value, semantic value, spell value and flat tone of the words as the learning output of the words, and uses the original emotion value, semantic value, spell value and flat tone of the words as the paired data learning output of the words
In the whole learning process, the method is divided into 2 steps, wherein the 1 st step is training M1L, M of a sub model, namely L, M L, and the training M1 is independently learned on a data set; after learning in the 1 st substep, the 3 trained models are brought into the whole algorithm to train the later fused part to obtain the whole MTL, which is the content of the 2 nd substep.
In this embodiment, the emotion recognition model M1 does not participate in learning, and uses a data set of emotion recognition in a short text emotion analysis algorithm study based on deep learning in a paper of north electric university Leng Yongcai in 2021 as input, a suitable emotion recognition model M1L can be trained, and L is a symbol indicating that learning is already performed.
M2 and M3 (including M3a and M3 b) are to be learned. According to the invention, the whole data DataT in the ancient electronic dataset chinese-poetry is divided into 3 parts at random, 30% of data are subjected to respective pre-learning, 50% of unified learning and 20% of testing. The data in 30% are M2 and M3, and a preliminary model is obtained; substituting the training data into the integral 2.4 step of the invention to carry out integral learning to obtain trained M2L, M L (M3 aL and M3 bL), and finally obtaining an integral model MTL.
1. The use process
In the process, 35 characters in total are input into ML1, ML2 and ML3 before and after the characters to be recognized. The emotion value is obtained through ML1, the pinyin with tone is obtained through ML3, and the semantic vector is obtained through ML 2. The overall MTL is then used to predict the specific text.
2. Feedback process
In the process, the invention integrates the text cases which are determined to be in deficiency by the text expert in the use process and feed the text cases back to the learning process, and the learning accuracy is improved by increasing the learning times of the error cases.