Disclosure of Invention
The invention aims to provide a Tibetan word segmentation and part-of-speech tagging integrated method and system, which can solve the problem that RNN has the limitation that parallel computation is difficult to perform efficiently in a traditional neural network model, CNN also has the problem that long-distance features cannot be captured, and error accumulation phenomenon can occur in the process of performing Tibetan part-of-speech tagging after Tibetan word segmentation. Meanwhile, the problem that the practicality is low cannot be solved only by a deep learning method in the aspect of Tibetan text sticky word processing.
The invention is realized in the following way:
In a first aspect, the application provides a Tibetan word segmentation and part-of-speech tagging integrated method, which comprises the following steps:
Acquiring Tibetan text information of a word to be segmented input by a user, and acquiring each Tibetan syllable and a corresponding label of the Tibetan text information;
calling an integrated model, dividing the Tibetan syllables and the non-Tibetan character blocks by adopting sound nodes and Unicode codes, and sequencing to obtain a Tibetan sequence, wherein the integrated model is a Tibetan word segmentation and part-of-speech tagging learning model which is built in advance based on Conformer;
Calling the integrated model to perform CRF prediction on the Tibetan sequences and corresponding labels to obtain the prediction label sequences corresponding to the Tibetan syllables in the Tibetan sequences;
and sorting the writing form of each Tibetan syllable according to the predictive label sequence to obtain a corresponding labeling result.
Based on the first aspect, the building of the integrated model comprises the following steps:
reading Tibetan language training corpus input by a user, and obtaining each Tibetan language syllable and a corresponding label;
constructing a Tibetan word segmentation and part-of-speech tagging integrated model frame based on Conformer;
Training to obtain and store parameters related to the Tibetan word segmentation and part-of-speech tagging integrated model based on Conformer, and generating a corresponding integrated model based on the related parameters.
Based on the first aspect, the constructing the Tibetan word segmentation and part-of-speech tagging integrated model framework based on Conformer includes:
Constructing Embedding layers based on a coding part of a transducer model, wherein the Embedding layers are used for acquiring input Tibetan sequences to form corresponding embedded vectors;
Setting up a Decoding layer based on the CRF model, wherein the Decoding layer is used for counting the probability P of a corresponding label of each Tibetan syllable in a text, and calculating an optimal label category according to the constraint relation of the sequence labels;
and establishing the Tibetan word segmentation and part-of-speech tagging integrated model framework based on the Embedding layer and the Decoding layer.
Based on the first aspect, after the Tibetan training corpus input by the user is read and each Tibetan syllable and the corresponding label are obtained, the method further includes:
Acquiring a data set of the Tibetan training corpus, and extracting features of each Tibetan syllable and adjacent syllables thereof in the data set to obtain fused syllable vectors representing feature words;
and establishing a corresponding word vector matrix based on the characteristic words, wherein the word vector matrix is used for standardizing the corresponding integrated model.
Based on the first aspect, the invoking the integrated model to perform CRF prediction on the Tibetan sequences and the corresponding labels, and the obtaining the prediction label sequences corresponding to the Tibetan syllables in the Tibetan sequences includes:
Obtaining corresponding Tibetan sequences based on the Tibetan syllablesWherein xi represents a syllable in the Tibetan word segmentation task and represents a Tibetan word in the Tibetan part-of-speech tagging task;
inputting the Tibetan sequence to the Embedding layer to generate an embedded vector
Inputting the embedded vector into a CNN network to fully learn local features of syllables or words, converting the syllables or words into local semantic vectors, and splicing current syllable or word vectors in the local semantic vectors;
Inputting the spliced vector to a CRF layer to conduct classified prediction and data decoding operation on the label, and obtaining the label through the given Tibetan sequencePredicting state sequences corresponding to syllablesAnd outputting the optimal label category, wherein yi represents the label corresponding to each Tibetan syllable or Tibetan word.
Based on the first aspect, if the Tibetan text information includes an adhesive word, the method further includes:
dividing an input sentence by using syllable points "·" as separators to obtain corresponding divided syllables, wherein the divided syllables are parts between every two syllable points;
traversing each segmentation syllable from back to front, and determining the last two characters of each segmentation syllable;
If the last two characters of the segmentation syllable areMarking the segmentation syllable as a quasi-sticky written word, matching the quasi-sticky written word with a first dictionary, and determining that the segmentation syllable is not sticky written word if the quasi-sticky written word is not in the first dictionary;
if the last character of the segmentation syllable isIf the syllable is marked as a pseudo-sticky word, the pseudo-sticky word is segmented from the part of the pseudo-sticky word, the first half part is matched with a second dictionary, and if the second dictionary is the first half part is supplemented with the pseudo-sticky wordThe second half part is converted, and if the second half part is not in the second dictionary, the first half part is motionless, and the second half part is converted;
The step of matching the pseudo-sticky word with a fourth dictionary, returning to the step of segmenting the pseudo-sticky word from an imaginary word part if the pseudo-sticky word is in the fourth dictionary, matching a first half part with a second dictionary, combining the pseudo-sticky word with a previous syllable if the pseudo-sticky word part is in the fourth dictionary, and matching a fifth dictionary, returning to the step of segmenting the pseudo-sticky word from the imaginary word part if the combined pseudo-sticky word is in the fifth dictionary, matching the first half part with the second dictionary, and determining that the pseudo-sticky word is not a sticky word if the combined pseudo-sticky word is not in the fifth dictionary;
the first dictionary comprisesA total of 4420 dictionaries for all the sticky written words of the four virtual words, wherein the second dictionary comprises 48 Tibetan structures and is used for judging whether the sticky written words need to be converted or not, and the third dictionary comprisesA fourth dictionary containing 2201 total of two virtual words and all pseudo-sticky wordsA dictionary of 4 sticky written words of two imaginary words, the fifth dictionary comprisingA dictionary of 9 non-sticky written words of two virtual words combined with the previous syllable.
In a second aspect, the present application further provides a Tibetan word segmentation and part-of-speech tagging integrated system, where the system includes:
The acquisition module is used for acquiring Tibetan text information of the word to be segmented input by a user and acquiring each Tibetan syllable and a corresponding label of the Tibetan text information;
The calling module is used for calling an integrated model, segmenting Tibetan syllables and non-Tibetan character blocks by adopting sound nodes and Unicode codes, and sequencing to obtain Tibetan sequences, wherein the integrated model is a Tibetan word segmentation and part-of-speech tagging integrated model based on Conformer;
the prediction module is used for calling the integrated model to perform CRF prediction on the Tibetan language sequence and the corresponding label to obtain the prediction label sequence corresponding to each Tibetan language syllable in the Tibetan language sequence;
And the labeling module is used for sorting the writing form of each Tibetan syllable according to the prediction tag sequence to obtain a corresponding labeling result.
In a third aspect, the present application provides an electronic device comprising a memory for storing one or more programs, a processor, the one or more programs, when executed by the processor, implementing a method as described in any of the first aspects.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the first aspects above.
Compared with the prior art, the invention has at least the following advantages or beneficial effects:
The application provides a Tibetan word segmentation and part-of-speech labeling integrated method, which comprises the steps of obtaining Tibetan text information of a word to be segmented input by a user, obtaining each Tibetan syllable and a corresponding label of the Tibetan text information, calling an integrated model, segmenting the Tibetan syllable and a non-Tibetan character block by adopting a sound node and Unicode code, calling the integrated model to conduct CRF prediction, obtaining the relation of the corresponding label between any Tibetan sequences of the Tibetan text, conducting global optimal label prediction on the Tibetan syllables, and sorting writing forms of the Tibetan syllables according to the label prediction result to obtain a corresponding labeling result. Compared with the prior art, the method and the device have the advantages that the problem that the part-of-speech tagging is accumulated in error due to the fact that the field word segmentation and the part-of-speech tagging are carried out according to two stages due to the fact that the integrated model is integrated, the problem that the part-of-speech tagging is carried out in error due to the fact that the word segmentation is carried out according to the two stages is solved, corresponding Tibetan word segmentation and part-of-speech tagging can be processed more accurately, and the practicability of the scheme is further improved.
The invention also provides a Tibetan word segmentation and part of speech tagging integrated system, electronic equipment and a computer readable storage medium, which correspond to the Tibetan word segmentation and part of speech tagging integrated method, so that the Tibetan word segmentation and part of speech tagging integrated system has the same beneficial effects.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The various embodiments and features of the embodiments described below may be combined with one another without conflict.
Examples
Referring to fig. 1, fig. 1 is a flowchart of a Tibetan word segmentation and part-of-speech tagging integrated method according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:
S10, acquiring Tibetan text information of a word to be segmented input by a user, and acquiring each Tibetan syllable and a corresponding label of the Tibetan text information;
Note that, in this embodiment, the text information content of a specific Tibetan is not limited, and it is understood that, for Tibetan, tibetan part of speech tagging is performed after Tibetan word segmentation, in this embodiment, the syllable and the matching method corresponding to the tag of Tibetan are not limited, the Tibetan word segmentation tag uses BMES four tags, the Tibetan part of speech tagging tag uses the tag code mentioned in the Tibetan part of speech tagging set for information processing, the "-" front part indicates the position of the syllable in the word, and the "-" rear part indicates the part of speech. For example, sentencesThe label of each syllable is B-nq, E-nq and S-ww respectively.
S11, calling an integrated model, segmenting Tibetan syllables and non-Tibetan character blocks by adopting sound nodes and Unicode codes, and sequencing to obtain Tibetan sequences;
It can be understood that the integrated model is a Tibetan word segmentation and part-of-speech tagging learning model established in advance based on Conformer, and in this embodiment, a specific establishment mode of the integrated model is not limited, and a specific Tibetan syllable and a segmentation mode of a non-Tibetan character block are not limited.
S12, calling an integrated model to conduct CRF prediction on the Tibetan sequences and corresponding labels to obtain corresponding prediction label sequences among Tibetan syllables in the Tibetan sequences;
The Tibetan sequences are sequences ordered based on Tibetan syllables, in this embodiment, CRF prediction and related optimal label prediction are not specifically limited, and it is understood that label prediction in this embodiment is labeling of the content of a specific Tibetan syllable, that is, corresponding part of speech labeling is performed through labels.
S13, according to the predicted tag sequence, the writing form of each Tibetan syllable is arranged to obtain a corresponding labeling result.
Through experimental result analysis, the table 1 is an accuracy comparison table of the word segmentation result of the integrated Tibetan, the table 2 is an accuracy comparison table of the word part of the Tibetan labeling result of the integrated Tibetan, and as shown in the tables 1 and 2, the word segmentation effect and the word part of the integrated Tibetan labeling effect of the integrated Tibetan are better than those of independently performing word segmentation and word part of the Tibetan labeling. The integrated model can more fully consider the dependency relationship between the word segmentation marks and the part-of-speech tagging information, and effectively organically combine the Tibetan word segmentation and the part-of-speech tagging together, so that the performance of the Tibetan word segmentation and the part-of-speech tagging is integrally improved.
Table 1. Accuracy of the integrated Tibetan word segmentation results is compared with the table:
table 2. Accuracy of the integrated Tibetan part-of-speech tagging results is compared with the table:
According to the Tibetan word segmentation and part-of-speech labeling integrated method, tibetan text information of a word to be segmented input by a user is obtained, each Tibetan syllable and a corresponding label of the Tibetan text information are obtained, an integrated model is called, a syllable node and Unicode code are adopted to segment the Tibetan syllable and a non-Tibetan character block, the integrated model is called to conduct CRF prediction, the relation of the corresponding labels between any Tibetan sequences of the Tibetan text is obtained, global optimal label prediction is conducted on the Tibetan syllables, the Tibetan sequences are sequences based on the Tibetan syllables in sequence, and the corresponding labeling results are obtained by sorting writing forms of the Tibetan syllables according to the label prediction results. Compared with the prior art, the method and the device have the advantages that the problem that the part-of-speech tagging is accumulated in error due to the fact that the field word segmentation and the part-of-speech tagging are carried out according to two stages due to the fact that the integrated model is integrated, the problem that the part-of-speech tagging is carried out in error due to the fact that the word segmentation is carried out according to the two stages is solved, corresponding Tibetan word segmentation and part-of-speech tagging can be processed more accurately, and the practicability of the scheme is further improved.
The above embodiment is not limited to the creation of the integrated model, and in some embodiments of the present application, the creation of the integrated model includes the following steps:
reading Tibetan language training corpus input by a user, and obtaining each Tibetan language syllable and a corresponding label;
constructing a Tibetan word segmentation and part-of-speech tagging integrated model frame based on Conformer;
Training to obtain and store parameters related to the Tibetan word segmentation and part-of-speech tagging integrated model based on Conformer, and generating a corresponding integrated model based on the related parameters.
In this embodiment, a specific method for building an integrated model is provided, training is performed by matching a built frame with a training corpus input in advance, and fig. 2 is a display diagram of training data provided in an embodiment of the present application, and by using the data training, a corresponding integrated model is finally generated, so that the scheme integrity is further increased.
In the above embodiments, how to build the corresponding integrated model frame is not specifically described, and in some embodiments of the present application, building the Tibetan segmentation and part-of-speech tagging integrated model frame based on Conformer includes:
Constructing Embedding layers based on a coding part of a transducer model, wherein Embedding layers are used for acquiring input Tibetan sequences to form corresponding embedded vectors;
Setting up a Decoding layer based on the CRF model, wherein the Decoding layer is used for counting the probability P of the corresponding label of each Tibetan syllable in the text, and calculating the optimal label category according to the constraint relation of the sequence labels;
and establishing a Tibetan word segmentation and part-of-speech tagging integrated model frame based on the Embedding layer and the Decoding layer.
In order to facilitate understanding, fig. 3 is a block diagram of an integrated model framework provided by the embodiment of the present application, as shown in fig. 3 and described above, a cnn+transform+crf model framework is mainly adopted, it should be noted that, an input Tibetan sequence Embedding layer forms a corresponding embedded vector, an Encoding part of the transform model is used for an Encoding layer, and the CRF model is still used for Decoding. For each input sequence, the input sequence is firstly encoded and then input into a transducer, the encoding layer of the transducer comprises a vector embedding layer and a position embedding layer, and simultaneously, a self-attention mechanism enables the model to pay attention to semantic information of different layers so as to be beneficial to establishing context semantic relation, then the input sequence is input into a conditional random field, a CRF layer counts probability P of corresponding labels of each syllable in a text, an optimal label class is calculated according to constraint relation of the sequence labels, a corresponding integrated model can be built through the scheme, and on the basis of learning sentence context information by using the transducer, a CNN network structure is introduced so that the model can learn local feature information in the text more fully.
In view of the subsequent specification for the model, in some embodiments of the present application, after reading the Tibetan training corpus input by the user and obtaining each Tibetan syllable and the corresponding label, the method further includes:
acquiring a data set of Tibetan language training corpus, and extracting features of each Tibetan language syllable and adjacent syllables thereof in the data set to obtain fused syllable vectors representing feature words;
And establishing a corresponding word vector matrix based on the feature words, wherein the word vector matrix is used for standardizing a corresponding integrated model.
In this embodiment, the method for establishing the data set and the corresponding word vector matrix is not specifically limited, and it should be noted that, through the above scheme, the fused feature words can be extracted, so that the model is further optimized, and the accuracy of the method is increased.
In the above embodiments, the specific CRF prediction manner is not limited to obtaining the corresponding label, and in some embodiments of the present application, invoking the integrated model to perform CRF prediction on the Tibetan sequences and the corresponding label, obtaining the corresponding predicted label sequence between the Tibetan syllables in the Tibetan sequences includes:
Obtaining corresponding Tibetan sequences based on Tibetan syllablesWherein xi represents a syllable in the Tibetan word segmentation task and represents a Tibetan word in the Tibetan part-of-speech tagging task;
input Tibetan sequence to Embedding layer to generate embedded vector
The embedded vector is input into a CNN network to fully learn the local characteristics of syllables or words, the syllables or words are converted into local semantic vectors, and the current syllable or word vectors in the local semantic vectors are spliced;
inputting the spliced vectors to a CRF layer to conduct classified prediction and data decoding operation on labels, and obtaining the labels through a given Tibetan sequencePredicting state sequences corresponding to syllablesAnd outputting the optimal label category, wherein yi represents the label corresponding to each Tibetan syllable or Tibetan word.
In order to facilitate understanding, fig. 4 is a flowchart of a task of word segmentation and part-of-speech tagging provided by the embodiment of the present application, as described in the foregoing embodiment and specifically, as can be understood from the content in fig. 4, in this embodiment, it is defined that, in the integration of this embodiment, optimal tag output is performed through a corresponding sequence, and, as shown in fig. 4, the content is an example, the model combines the word segmentation and part-of-speech tagging tags, where the tag is composed of the word segmentation tag and the part-of-speech tag together, and the integrated result only displays the part-of-speech tag, and connects the part-of-speech tag with "/" behind the syllable of the word segmentation tag as E or S. For example, for sentences(Functional scaffold) ", its input sequenceIts tag sequenceThen its word segmentation result isTherefore, the specific CRF prediction of the scheme is further increased, and the scheme of the corresponding label between any Tibetan sequences of the Tibetan text is obtained, so that the integrity of the scheme is improved.
Since there are also bonding words in the Tibetan, fig. 5 is a flowchart of a processing method of bonding words according to an embodiment of the present application, and in an exemplary embodiment of the present application, if the Tibetan text information includes bonding words, the method further includes:
dividing the input sentence by using syllable points "·" as separators to obtain corresponding divided syllables, wherein the divided syllables are the parts between every two syllable points;
traversing each segmented syllable from back to front, and determining the last two characters of each segmented syllable;
If the last two characters of syllables are segmented intoMarking the syllable to be segmented as a quasi-sticky written word, matching the quasi-sticky written word with a first dictionary, and determining that the syllable to be segmented is not sticky written word if the quasi-sticky written word is not in the first dictionary;
If the last character of the syllable is segmented to beIf the syllable is marked as a pseudo-sticky word, the pseudo-sticky word is segmented from the part of the pseudo-sticky word, the first half part is matched with a second dictionary, and if the first half part is complemented in the second dictionaryThe second half part is converted, and if the second half part is not in the second dictionary, the first half part is motionless, and the second half part is converted;
The method comprises the steps of matching a pseudo-sticky word with a third dictionary, determining that the pseudo-sticky word is not a sticky word if the pseudo-sticky word is not in the third dictionary, matching the pseudo-sticky word with a fourth dictionary if the pseudo-sticky word is in the third dictionary, returning to the step of segmenting the pseudo-sticky word from an imaginary word part if the pseudo-sticky word is in the fourth dictionary, matching a first half part with a second dictionary, combining the pseudo-sticky word with a previous syllable if the pseudo-sticky word part is in the fourth dictionary, matching a fifth dictionary, returning to the step of segmenting the pseudo-sticky word from the imaginary word part if the combined pseudo-sticky word is in the fifth dictionary, and matching the first half part with the second dictionary, and determining that the pseudo-sticky word is not a sticky word if the combined pseudo-sticky word is not in the fifth dictionary;
The first dictionary comprisesA total of 4420 dictionaries for all the sticky written words of the four virtual words, wherein the second dictionary comprises 48 Tibetan structures for judging whether the sticky written words need to be converted or not, and the third dictionary comprisesA total of 2201 dictionaries for all pseudo-sticky written words of two virtual words, wherein the fourth dictionary comprisesA dictionary of 4 sticky written words of two imaginary words, the fifth dictionary comprisingA dictionary of 9 non-sticky written words of two virtual words combined with the previous syllable.
In this embodiment, by matching the corresponding dictionary, a scheme of how to process the corresponding cohesive word is provided, so that the processing width of the model is further improved, and the integrity and practicality of the scheme are improved.
Based on the same inventive concept, the application also provides a system, fig. 6 is a structural diagram of a Tibetan word segmentation and part of speech tagging integrated system provided by the embodiment of the application, as shown in fig. 6, the system includes:
The acquisition module 1 is used for acquiring Tibetan text information of words to be segmented input by a user and acquiring each Tibetan syllable and a corresponding label of the Tibetan text information;
the calling module 2 is used for calling an integrated model, segmenting Tibetan syllables and non-Tibetan character blocks by adopting sound nodes and Unicode codes, and sequencing to obtain Tibetan sequences, wherein the integrated model is a Tibetan word segmentation and part-of-speech labeling integrated model based on Conformer;
The prediction module 3 is used for calling the integrated model to perform CRF prediction on the Tibetan sequences and the corresponding labels to obtain corresponding prediction label sequences among all Tibetan syllables in the Tibetan sequences;
And the labeling module 4 is used for sorting the writing form of each Tibetan syllable according to the prediction label sequence to obtain a corresponding labeling result.
The specific embodiments of the above system and the corresponding beneficial effects are described in the method section, and are not described in detail herein.
Referring to fig. 7, fig. 7 is a block diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a memory 101, a processor 102 and a communication interface 103, wherein the memory 101, the processor 102 and the communication interface 103 are electrically connected with each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, such as program instructions/modules corresponding to a Tibetan word segmentation and part-of-speech tagging integrated system provided in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 101, thereby executing various functional applications and data processing. The communication interface 103 may be used for communication of signaling or data with other node devices.
The Memory 101 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general purpose processor including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc., or may be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
It will be appreciated that the configuration shown in fig. 7 is merely illustrative, and that the electronic device may also include more or fewer components than shown in fig. 7, or have a different configuration than shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps as described in the method embodiments above.
It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In summary, the specific embodiments and the corresponding beneficial effects of the electronic device and the computer-readable storage medium provided in the embodiments of the present application are described in the above method section, and are not described herein.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.