Movatterモバイル変換


[0]ホーム

URL:


CN114218939A - Text word segmentation method, device, equipment and storage medium - Google Patents

Text word segmentation method, device, equipment and storage medium
Download PDF

Info

Publication number
CN114218939A
CN114218939ACN202111530194.2ACN202111530194ACN114218939ACN 114218939 ACN114218939 ACN 114218939ACN 202111530194 ACN202111530194 ACN 202111530194ACN 114218939 ACN114218939 ACN 114218939A
Authority
CN
China
Prior art keywords
label
word segmentation
sequence
text
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111530194.2A
Other languages
Chinese (zh)
Other versions
CN114218939B (en
Inventor
胡羽蓝
李佳轩
陈洪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co LtdfiledCriticalBeijing Dajia Internet Information Technology Co Ltd
Priority to CN202111530194.2ApriorityCriticalpatent/CN114218939B/en
Publication of CN114218939ApublicationCriticalpatent/CN114218939A/en
Application grantedgrantedCritical
Publication of CN114218939BpublicationCriticalpatent/CN114218939B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本公开关于一种文本分词方法、装置、设备及存储介质,该方法是在分词解码阶段,根据每个标签序列对应的分词序列中每个分词在其上一个分词出现的条件下出现的概率得到该标签序列的困惑度,以用于评价该标签序列的合理程度,并结合分词编码阶段学习到的每个标签序列对应的权重,选择出最优的标签序列,从而保证在低资源场景下分词效果的鲁棒性和准确率。相比于相关技术,本申请技术方案无需大量的计算资源即可完成分词,适用于工业场景,也无需通过堆叠多个任务的模型来促进分词模型的分词表现,而是通过改进编码算法实现上述技术效果。

Figure 202111530194

The present disclosure relates to a text word segmentation method, device, equipment and storage medium. The method is to obtain the probability of occurrence of each word in the word segmentation sequence corresponding to each tag sequence under the condition that the previous word appears in the word segmentation decoding stage. The perplexity of the label sequence is used to evaluate the reasonableness of the label sequence, and combined with the weight corresponding to each label sequence learned in the word segmentation coding stage, the optimal label sequence is selected to ensure word segmentation in low-resource scenarios. Robustness and accuracy of the effect. Compared with the related art, the technical solution of the present application can complete word segmentation without a lot of computing resources, which is suitable for industrial scenarios, and does not need to stack models of multiple tasks to promote the word segmentation performance of the word segmentation model, but achieves the above by improving the coding algorithm. technical effect.

Figure 202111530194

Description

Text word segmentation method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of natural language processing, and in particular, to a text word segmentation method, apparatus, device, and storage medium.
Background
Text word segmentation is a basic task in the field of natural language processing, and as the basis of other natural language processing tasks, a word segmentation model based on a neural network model needs to keep better robustness in any scene. However, for low-resource scenes lacking training data, the word segmentation model usually cannot predict unknown words well, resulting in inaccurate word segmentation results.
For the low-resource scene, the related technology adopts a model with a plurality of stacked tasks to make up for the deficiency of a single word segmentation model, and the multitask model essentially improves the robustness of the word segmentation model and the accuracy of word segmentation results by learning common knowledge between different standard data sets and unique knowledge of a single data set and aggregating the two parts of knowledge, however, a strong feature extraction layer is required to balance the two parts of information, the training requirement is high, and the obtained model is difficult to reach the expected target.
Disclosure of Invention
The disclosure provides a text word segmentation method, a text word segmentation device, text word segmentation equipment and a storage medium, which are used for at least solving the problem that in the related technology, for a low-resource scene lacking training data, a word segmentation model cannot well predict unknown words generally, so that the word segmentation result is inaccurate. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a text word segmentation method, including: acquiring a plurality of label sequences corresponding to a text to be participled, wherein the label sequences are used for segmenting the text to be participled into corresponding word segmentation sequences; coding the text to be word segmented to obtain the weight corresponding to each label sequence; decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
With reference to the first aspect, in a possible implementation manner of the first aspect, the decoding to obtain the target tag sequence according to the weight and the confusion degree corresponding to each tag sequence includes: for the word segmentation sequence corresponding to each label sequence, determining the statistical probability of the first word segmentation in the word segmentation sequence and the conditional statistical probability of each of the rest word segmentations, wherein the conditional statistical probability of the kth word segmentation refers to the statistical probability of the kth word segmentation under the condition of the kth-1 word segmentation, and k is a positive integer greater than 1; determining the confusion degree of the label sequence according to the statistical probability of the first participle and the conditional statistical probability of each rest participles; and determining a target label sequence according to the corresponding weight and the confusion degree of each label sequence.
With reference to the first aspect, in a possible implementation manner of the first aspect, the determining a statistical probability of occurrence of a first participle in the participle sequence and a conditional statistical probability of each remaining participle includes: determining the probability of the first participle appearing in a preset corpus set as the statistical probability of the first participle appearing; determining the continuous occurrence probability of two adjacent participles in a preset corpus set and the occurrence probability of each participle in the preset corpus set in all the other participles; and determining the conditional statistical probability of the kth participle according to the continuous occurrence probability of the kth participle and the kth-1 participle and the occurrence probability of the kth-1 participle.
With reference to the first aspect, in a possible implementation manner of the first aspect, the preset corpus is constructed according to a standard corpus and an associated corpus of the standard corpus in a target field, where the associated corpus of the standard corpus refers to a corpus labeled based on a same character label labeling rule as the standard corpus, the associated corpus belongs to a non-target field, the character label labeling rule refers to a rule for labeling a character label for a text in the corpus, and the character label is used to label a position of each character in the text in a participle of the text.
With reference to the first aspect, in a possible implementation manner of the first aspect, each of the standard corpus and the associated corpus includes a plurality of texts, each of the texts includes a plurality of participles, and characters in the participles are labeled with character tags; the number of the same participles in the standard corpus set and the associated corpus set is larger than a preset number, and the labeled character labels of the characters in the same participle in the standard corpus set and the labeled character labels in the associated corpus set are the same.
With reference to the first aspect, in a possible implementation manner of the first aspect, the tag sequence includes a character tag corresponding to each character in the text to be participled, where the character tag is used to mark a position of the character in a participle; the weight corresponding to the label sequence comprises a transmitting weight and a state transition weight corresponding to each character label in the label sequence.
With reference to the first aspect, in a possible implementation manner of the first aspect, the decoding to obtain the target tag sequence according to the weight and the confusion degree corresponding to each tag sequence includes: determining a first weight and a second weight sum corresponding to each label sequence, wherein the first weight sum is the sum of the emission weights corresponding to each character label in the label sequence, and the second weight sum is the sum of the state transition weights corresponding to each character label in the label sequence; determining the scores of the label sequences according to the first weight sum, the second weight sum and the confusion degree corresponding to the label sequences; and determining the label sequence corresponding to the maximum score as the target label sequence.
With reference to the first aspect, in a possible implementation manner of the first aspect, the encoding the text to be word-segmented to obtain a weight corresponding to each tag sequence includes: and inputting the text to be participled into a coding model trained by utilizing the preset corpus, and outputting the emission weight and the state transition weight corresponding to each character tag in each tag sequence.
According to a second aspect of the embodiments of the present disclosure, there is provided a text segmentation apparatus, including: the system comprises a label sequence acquisition unit, a word segmentation unit and a word segmentation unit, wherein the label sequence acquisition unit is used for acquiring a plurality of label sequences corresponding to a text to be word segmented, and the label sequences are used for segmenting the text to be word segmented into corresponding word segmentation sequences; the coding unit is used for coding the text to be participled to obtain the weight corresponding to each label sequence; the decoding unit is used for decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
With reference to the second aspect, in a possible implementation manner of the second aspect, the decoding unit is specifically configured to: for the word segmentation sequence corresponding to each label sequence, determining the statistical probability of the first word segmentation in the word segmentation sequence and the conditional statistical probability of each of the rest word segmentations, wherein the conditional statistical probability of the kth word segmentation refers to the statistical probability of the kth word segmentation under the condition of the kth-1 word segmentation, and k is a positive integer greater than 1; determining the confusion degree of the label sequence according to the statistical probability of the first participle and the conditional statistical probability of each rest participles; and determining a target label sequence according to the corresponding weight and the confusion degree of each label sequence.
With reference to the second aspect, in a possible implementation manner of the second aspect, the decoding unit is specifically configured to: determining the probability of the first participle appearing in a preset corpus set as the statistical probability of the first participle appearing; determining the continuous occurrence probability of two adjacent participles in a preset corpus set and the occurrence probability of each participle in the preset corpus set in all the other participles; and determining the conditional statistical probability of the kth participle according to the continuous occurrence probability of the kth participle and the kth-1 participle and the occurrence probability of the kth-1 participle.
With reference to the second aspect, in a possible implementation manner of the second aspect, the preset corpus is constructed according to a standard corpus in a target field and an associated corpus of the standard corpus, where the associated corpus of the standard corpus refers to a corpus labeled based on a same character label labeling rule as the standard corpus, the associated corpus belongs to a non-target field, the character label labeling rule refers to a rule for labeling a character label for a text in the corpus, and the character label is used to label a position of each character in a participle of the text.
With reference to the second aspect, in a possible implementation manner of the second aspect, each of the standard corpus and the associated corpus includes a plurality of texts, each of the texts includes a plurality of participles, and characters in the participles are labeled with character tags; the number of the same participles in the standard corpus set and the associated corpus set is larger than a preset number, and the labeled character labels of the characters in the same participle in the standard corpus set and the labeled character labels in the associated corpus set are the same.
With reference to the second aspect, in a possible implementation manner of the second aspect, the tag sequence includes a character tag corresponding to each character in the text to be participled, and the character tag is used to mark a position of the character in a participle; the weight corresponding to the label sequence comprises a transmitting weight and a state transition weight corresponding to each character label in the label sequence.
With reference to the second aspect, in a possible implementation manner of the second aspect, the decoding unit is specifically configured to: determining a first weight and a second weight sum corresponding to each label sequence, wherein the first weight sum is the sum of the emission weights corresponding to each character label in the label sequence, and the second weight sum is the sum of the state transition weights corresponding to each character label in the label sequence; determining the scores of the label sequences according to the first weight sum, the second weight sum and the confusion degree corresponding to the label sequences; and determining the label sequence corresponding to the maximum score as the target label sequence.
With reference to the second aspect, in a possible implementation manner of the second aspect, the encoding unit is specifically configured to: and inputting the text to be participled into a coding model trained by utilizing the preset corpus, and outputting the emission weight and the state transition weight corresponding to each character tag in each tag sequence.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor, a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the text segmentation method as provided in the first aspect and any one of its possible designs.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of a server, enable the server to perform a text-tokenization method as provided by the first aspect and any one of its possible designs.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when run on a server, cause the server to perform the text segmentation method as provided by the first aspect and any one of its possible designs.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: in the segmentation decoding stage, the confusion degree of a label sequence is obtained according to the probability that each segmentation word in the segmentation word sequence corresponding to the label sequence appears under the condition that the last segmentation word appears, so as to evaluate the reasonable degree of the label sequence, and the optimal label sequence, namely the target label sequence, is selected by combining the corresponding weight of each label sequence learned in the segmentation coding stage, so that the robustness and the accuracy of the segmentation effect in a low-resource scene are ensured. Compared with the related technology, the technical scheme of the application can complete word segmentation without a large amount of computing resources, is suitable for industrial scenes, does not need to promote word segmentation expression of a word segmentation model by stacking a plurality of task models, and achieves the technical effects by improving a coding algorithm.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a schematic diagram of a text segmentation system in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a method for text segmentation, according to an exemplary embodiment;
FIG. 3 is a diagram illustrating a coding model structure according to an example embodiment;
FIG. 4 is a block diagram illustrating a text-tokenizing apparatus according to an example embodiment;
fig. 5 is a schematic diagram illustrating a server architecture according to an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In addition, in the description of the embodiments of the present disclosure, "/" indicates an OR meaning, for example, A/B may indicate A or B, unless otherwise specified. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more.
Text is a continuous word sequence, and text word segmentation is a process of recombining continuous word sequences into a word sequence according to a certain rule. In English, a space is used as a natural delimiter among words, Chinese is only a character, a sentence and a paragraph can be simply delimited by an obvious delimiter, and only a unique word does not have a formal delimiter, so that a certain rule is required to be utilized to identify the boundaries of words and phrases in a text, and the boundaries of words and phrases in the text are identified, and finally, a process of word sequence is obtained, namely a text word segmentation processing process.
In the embodiment of the present disclosure, a word sequence obtained by segmenting a text is referred to as a word segmentation sequence, the word segmentation sequence includes a plurality of words, and an arrangement sequence of the words is consistent with a position sequence of the words in the text. Illustratively, the word segmentation for the text "this is my favorite song" can result in a word segmentation sequence [ this/me/most/favorite/song ].
The text word segmentation method provided by the embodiment of the disclosure can be applied to a text word segmentation system, and the text word segmentation system is used for performing word segmentation on a text to be processed. Fig. 1 is a schematic diagram of a text word segmentation system, as shown in fig. 1, the text word segmentation system includes a textword segmentation apparatus 11 for performing word segmentation processing on a text to be processed, and aserver 12 communicatively connected to the textword segmentation apparatus 11.
The textword segmentation device 11 is configured to execute the text word segmentation method provided in the embodiment of the present disclosure, so as to segment the text to be processed into a word sequence by using the text word segmentation method. For example, a plurality of label sequences corresponding to a text to be participled are obtained, and the label sequences are used for segmenting the text to be participled into corresponding word segmentation sequences; coding the text to be word segmented to obtain the weight corresponding to each label sequence; decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
It should be noted that the above characters refer to the individual characters forming words in the text. It should be understood that for words that contain multiple words, each word is located differently in the word, i.e., the word bits of each word are different. The word positions comprise first words, middle words and tail words, and the positions of the words in the words do not need to be distinguished by the positions of the words in the case of the words which form words independently. The character label is used for marking the position of a single character in a word. In one exemplary label system, the character label may be B, M, E, S, where B, M, E denotes the first, middle and end words of the word, respectively, and S denotes the single word. For example, if the word "happiness" in a certain text is labeled with the label "B", it means that "happiness" is the first word of a certain word. And segmenting the text into word sequences according to the label marked on each word in the text.
It should be noted that "B, M, E, S" is merely an exemplary label system provided by the embodiments of the present disclosure, and in other embodiments, words in the text may be labeled based on different label systems. In one exemplary label system, the character label may be B, I, where B denotes the first character of the word and I denotes the other (non-first) character of the word. In another exemplary labeling scheme, the character labels may be S, B, M1, M2, M, E, S, where B denotes the first word, M1/M2/M denotes the middle word, E denotes the end word, and S denotes the single word. It should be understood that the adopted tag system can be a standard tag system in the field of natural language processing, and can also be a tag system customized to meet the requirements of a scene.
Thetext segmentation device 11 can perform data interaction with theserver 12. For example, thetext segmentation device 11 may obtain the text to be segmented from theserver 12. For another example, thetext segmentation device 11 may also send a segmentation result obtained by segmenting the text to be segmented to theserver 12.
Theserver 12 may be a server, or may also be a server cluster composed of a plurality of servers or a cloud computing service center, which is not limited in this disclosure. Theserver 12 is used for collecting text to be participled, such as text uploaded by a receiving user side. Theserver 12 may also be configured to receive the text segmentation result sent by thetext segmentation device 11. Theserver 12 may also perform other natural language processing tasks based on the text segmentation results, such as building a thesaurus based on the text segmentation results, calculating semantic similarity between texts, or searching for other texts similar to a certain text.
It should be noted that thetext segmentation apparatus 11 and theserver 12 may be independent devices or may be integrated into the same device, and the present invention is not limited to this.
In some embodiments, the textword segmentation apparatus 11 may be an electronic device, or may be included in an electronic device, where the electronic device includes but is not limited to various computer devices such as a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle-mounted terminal, a palm terminal, an Augmented Reality (AR) device, a Virtual Reality (VR) device, and the like. The embodiment of the present disclosure does not particularly limit the specific form of the electronic device.
When the textword segmentation device 11 and theserver 12 are integrated into the same device, the communication mode between the textword segmentation device 11 and theserver 12 is the communication between the internal modules of the device. In this case, the communication flow between the two is the same as the communication flow between thetext segmentation apparatus 11 and theserver 12 when they are independent of each other.
In the following embodiments provided by the present invention, the textword segmentation apparatus 11 and theserver 12 are mainly set independently from each other for illustration.
The text word segmentation method provided by the embodiment of the present disclosure can also be applied to various natural language processing methods implemented at the back end of various service scenes, including but not limited to: for example, lexical analysis methods such as new word discovery, part of speech tagging, spelling correction and the like, syntactic analysis methods such as composition syntax analysis, dependency syntax participle, sentence boundary detection and the like, semantic analysis methods such as semantic disambiguation, semantic role tagging and the like, and information extraction methods such as named entity identification, entity disambiguation, emotion analysis, intention identification and the like. The text word segmentation method provided by the embodiment of the disclosure can be used as a basic link of the natural language processing method. For example, in the part-of-speech tagging method, the text word segmentation method provided by the embodiment of the present disclosure is used to segment a text to be tagged to obtain a word segmentation sequence, and then part-of-speech tagging is performed on words in the word segmentation sequence by using a part-of-speech tagging rule. For another example, in the named entity recognition method, the text word segmentation method provided by the embodiment of the present disclosure is used to segment a text to be processed to obtain a word segmentation sequence, and then an entity recognition rule is used to recognize a named entity in the word segmentation sequence.
In some embodiments, a training corpus of a certain scale is used to train an initial word segmentation model based on a neural network model, and model parameters are continuously optimized, so that the model has the capability of learning the feature knowledge of a text and segmenting the text according to the learned features. After the word segmentation model is trained, the text to be segmented is input into the word segmentation model, so that a corresponding word segmentation sequence is obtained by utilizing the word segmentation model. The text word segmentation process can be divided into a word segmentation encoding stage and a word segmentation decoding stage, the word segmentation encoding stage can be understood as a stage of extracting features from a text to be word segmented by using an encoding layer of a model, and the word segmentation decoding stage can be understood as a stage of processing the extracted features by using a decoding algorithm to determine an optimal word segmentation result.
As mentioned above, since text segmentation is a basic task in the field of natural language processing, and is used as a basis for other natural language processing tasks, it is required that a segmentation model needs to maintain good robustness in any scenario. However, for low-resource scenes lacking training data, the word segmentation model usually cannot predict unknown words well, resulting in inaccurate word segmentation results.
For the low-resource scene, the related technology performs training of the word segmentation task in a specific scene according to the pre-trained model, so that less corpus data is needed for training. Or, the defects of a single word segmentation model are made up by stacking models of a plurality of tasks, and word segmentation expression of the word segmentation model is promoted. It can be seen that the foregoing solutions are all improved upon in the context of the participle encoding stage. However, because the pre-trained model is often a complex model with a huge volume, this means that when the pre-trained model is applied to an actual scene, a large amount of computing resources are required, and thus the applicable scene is limited, such as not applicable to an industrial scene with insufficient computing resources. The multitask model essentially improves the robustness of the word segmentation model and the accuracy of word segmentation results by learning common knowledge between different standard data sets and unique knowledge of a single data set and aggregating the two parts of knowledge, which means that a strong feature extraction layer is needed to balance the two parts of information, the training requirement is high, and the obtained model is difficult to achieve the expected target.
The embodiment of the disclosure provides a text word segmentation method, which introduces confusion information of a word segmentation sequence corresponding to each label sequence in a word segmentation decoding stage, wherein the confusion information is an index for evaluating the reasonable degree of the word segmentation sequence, so that the basis of word segmentation decoding is enriched, a better word segmentation effect can be realized without a large amount of computing resources, the method is suitable for industrial scenes, and the word segmentation performance of a word segmentation model is promoted without stacking a plurality of task models, so that additional training difficulty is not brought.
Fig. 2 is a flowchart illustrating a text word segmentation method according to an exemplary embodiment of the present disclosure, where as shown in fig. 2, the method may include:
s201, obtaining a plurality of label sequences corresponding to the text to be participled, wherein the label sequences are used for segmenting the text to be participled into corresponding word segmentation sequences.
In the embodiment of the present disclosure, the text to be segmented may be a sentence, or a segment of text, etc. Characters in the text to be segmented refer to single characters which can form words. The disclosed embodiments are not limited to the number of characters in the text.
The label sequence is composed of a plurality of character labels with characteristic arrangement sequence, the number of the character labels in the label sequence is the same as the number of words of the text to be segmented, and the character labels are in one-to-one correspondence with the characters contained in the text to be segmented. According to a given tag sequence, a segmentation result, i.e. a segmentation sequence, can be obtained.
In fact, according to a preset label system, the text to be segmented corresponds to a label matrix, and the label matrix includes a plurality of label sequences. For example, taking the text to be segmented "i like listening to music" as an example, the corresponding tag matrix may be:
Figure BDA0003410383420000081
without constraint, 4 can be obtained based on the label matrix6A tag sequence, e.g. [ B B B B B B B ]]、[B M B M B B]Etc., not to be enumerated here. Different tag sequences mean different word segmentation results, e.g., [ SS S S S S S]The corresponding word segmentation sequence is [ I/xi/Huan/hear/music ]]And [ S S S S B E]The corresponding word segmentation sequence is [ I/xi/Huan/listen/music ]]。
As will be readily understood, 4 is6The individual tag sequences comprise apparently unreasonable tag sequences, e.g. [ B B B B B B B B]And evaluating each label sequence based on a specific algorithm to obtain the evaluation score of each label sequence, and finally obtaining the optimal label sequence so as to obtain the most reasonable word segmentation sequence.
It should be noted that, the label matrix corresponding to the same text to be participled is different according to the label system used, and the above example shows the label matrix corresponding to the text to be participled when "B, M, E, S" is used. When "B, I" is used, the label matrix corresponding to the text to be segmented "i like listening to music" is:
Figure BDA0003410383420000091
without constraint, 2 can be obtained based on the label matrix6A sequence of tags, not listed here. The word segmentation sequences obtained based on different tag sequences are also different.
In S201, the obtained tag sequence corresponding to the text to be participled may be all tag matrices included in the tag matrix corresponding to the text to be participled, or may be a tag sequence in which the reasonableness of all tag matrices satisfies a certain condition. For example, the tag matrix is processed based on a certain filtering rule to filter out unreasonable tag sequences, so as to obtain tag sequences whose rationality satisfies a certain condition.
S202, coding the text to be word-segmented to obtain the weight corresponding to each label sequence.
In some embodiments, the text to be participled is input into a coding model trained by using a preset corpus, and an emission weight matrix and a state transition weight matrix are output, wherein the emission weight matrix and the state transition weight matrix have the same dimension as a label matrix corresponding to the text to be participled. The transmission weight matrix comprises the transmission weight of each character tag for each character, namely the transmission weight corresponding to each character tag in each tag sequence, and the state transition weight matrix comprises the probability of transferring from one character tag to another character tag, namely the state transition weight corresponding to each character tag in each tag sequence, namely the probability of transferring the character tag to the next character tag.
It can be seen that in some implementation scenarios, the word to be segmented sequence is processed by the coding model, and the transmission weight and the state transition weight corresponding to each character tag in each tag sequence can be output. According to the method, under the condition that an encoding model is not changed, in a word segmentation decoding stage, the confusion degree corresponding to a certain label sequence is combined with the emission weight and the state transition weight corresponding to each character label in each label sequence learned in the word segmentation encoding stage, and the optimal label sequence is determined, so that the robustness and the accuracy of the word segmentation effect in a low-resource scene are guaranteed.
In some possible implementation manners, firstly, a character vector of each character in a text to be participled is obtained, and a character matrix corresponding to the text to be participled is obtained, wherein the character matrix comprises the character vector of each character in the text; and then extracting a characteristic vector from the character matrix by using a preset coding model, and outputting an emission weight matrix according to the extracted characteristic vector.
In specific implementation, the balance corpora related to a specific scene or field can be collected in advance, and the collected balance corpora are preprocessed to filter useless data, low-frequency characters and meaningless characters to obtain training data. Training a preset model by using training data to obtain a character vector model, wherein the preset model can be a Skip-gram model; and finally, generating a mapping dictionary of the character vectors according to the character vector model, wherein the mapping dictionary comprises the mapping relation between the characters and the character vectors. When the character vector of each character in the text to be segmented needs to be acquired, a mapping dictionary of the character vector can be acquired first, and the character vector of each character can be searched from the mapping dictionary.
Referring to fig. 3, in some possible implementations, the preset coding model may include: a Convolutional Neural Network (CNN), a BilSTM composed of two Long-Term Memory Networks (LSTMs) with opposite timing directions, and an output layer.
The eigenvectors of the character matrix can be obtained through a Convolutional Neural Network (CNN). The CNN is a feedforward neural network, and the artificial neurons of the CNN can respond to peripheral units in a part of coverage range, can be applied to the field of natural language processing, realize local linkage, weight sharing and the like, and can effectively extract features. The CNN includes convolutional and pooling layers. The convolutional layer is a feature extraction layer, the input of each neuron is connected with the local receiving domain of the previous layer, and the local feature is extracted. Once the local features are extracted, the positional relationship between the extracted features and other features is also determined. The pooling layer is a feature mapping layer, each calculation layer of the network is composed of a plurality of feature mappings, each feature mapping is a plane, and the weights of all neurons on the plane are equal. The feature mapping structure adopts a Sigmoid function as an activation function of the CNN, so that the feature mapping has displacement invariance. In addition, since the neurons on one mapping surface share the weight, the number of free parameters of the network is reduced.
When the transmission weight matrix is generated according to the feature vectors, the feature vectors can be respectively input to the two LSTMs, output vectors generated by the two LSTMs at each time node in a preset time period are obtained, the output vectors formed by each time node are spliced to generate spliced vectors, the spliced vectors are transmitted to the output layer, and the vectors output by the output layer are synthesized into the transmission weight matrix. The LSTM is an extension of a Recurrent Neural Networks (RNNs), a basic unit of the LSTM can realize a memory function of information, and can control memory, forgetting and output of historical information by three structures including an input gate, a forgetting gate and an output gate, so that the LSTM has a long-term memory function and can perfectly solve the problem of long-distance dependence.
The transition probability between character labels refers to the probability that a certain character label appears after another character label, i.e. the probability of transition from another character label to the character label.
In some embodiments, the transition probability between character tags may be the number of times two character tags appear next to each other in the data set/the total number of times in the data set that a preceding character tag appears. For example, each character in each text in the data set is labeled with a character tag, so that the total number of occurrences of each character tag in the data set and the number of occurrences of any two character tags in front of and behind can be counted. Assuming that the total number of occurrences of tag B in the data set is 100, tag B and tag I occur in succession in the data set and the number of previous occurrences of tag B is 60, then the probability of tag B transitioning to tag I is 0.6. Furthermore, a state transition weight matrix corresponding to the label matrix can be obtained based on the data set, and the state transition weight matrix comprises transition probabilities between character labels in each label sequence.
In some implementations, the predetermined coding model outputs the state transition weight matrix at the same time as outputting the transmit weight matrix.
S203, decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
Based on the Markov assumption that the probability of the current word appearing depends on the previous word, then for a certain segmentation sequence, the probability of each of the remaining segmentations, except the first word, appearing depends on its previous segmentation. If the probability of the occurrence of the kth word determined based on the kth-1 word is higher, the rationality of the kth word is higher, and otherwise, if the probability of the occurrence of the kth word determined based on the kth-1 word is lower, the rationality of the kth word is lower. Further, based on the rationality of each word in the word segmentation sequence, the rationality of the word segmentation sequence can be obtained, and the rationality of the word segmentation sequence is equal to that of the tag sequence.
Based on this, the embodiment of the disclosure obtains the confusion degree of the tag sequence according to the word segmentation sequence corresponding to each possible tag sequence, so as to judge the rationality of the tag sequence based on the obtained confusion degree, and determines the optimal tag sequence by combining the emission weight and the state weight corresponding to each character tag in the tag sequence, thereby enriching the basis of word segmentation decoding, and solving the problem of poor robustness and accuracy of the model word segmentation effect caused by lack of training data in a low-resource scene.
In some possible implementations, S203 may specifically include: for a participle sequence corresponding to each label sequence, determining the statistical probability of the occurrence of a first participle in the participle sequence and the conditional statistical probability of each of the rest participles, wherein the conditional statistical probability of a kth participle refers to the statistical probability of the occurrence of the kth participle under the condition of the occurrence of the kth participle-1, the kth participle is any one of the rest participles, and k is a positive integer greater than 1; determining the confusion degree of the label sequence according to the statistical probability of the first word segmentation and the conditional statistical probability of each of the rest word segmentations; and determining the target label sequence according to the corresponding weight and the confusion degree of each label sequence.
Illustratively, the statistical probability of each participle in the participle sequence occurring under the condition of the occurrence of the last participle may be represented as P (w)k|wk-1) Wherein w isk-1Denotes the k-1 th participle, w, in the participle sequencekRepresenting the kth participle in the participle sequence,k is [2, n ]]And n represents the number of participles in the participle sequence.
In a possible implementation manner, a product of a statistical probability of occurrence of a 1 st participle in the participle sequence and a conditional statistical probability of each participle in subsequent 2 nd-nth participles may be used as a confusion degree of the participle sequence, that is, a confusion degree of the tag sequence, as shown in the following formula 1:
Figure BDA0003410383420000121
wherein pp(s) indicates the degree of confusion of the tag sequence; p (w)1) Representing the statistical probability of the occurrence of the first word segmentation in the word segmentation sequence corresponding to the label sequence; p (w)k|wk-1) Representing a conditional statistical probability of a kth participle in the participle sequence.
In some implementation manners, for a participle sequence corresponding to each tag sequence, the probability of the occurrence of a first participle in a preset corpus can be determined as the statistical probability of the occurrence of the first participle; determining the continuous occurrence probability of two adjacent participles in a preset corpus set and the occurrence probability of each participle in the preset corpus set in all the other participles; and determining the conditional statistical probability of the kth participle according to the continuous occurrence probability of the kth participle and the kth-1 participle and the occurrence probability of the kth-1 participle. In the embodiment of the disclosure, the probability of the first participle in the participle sequence appearing in the preset corpus set and the conditional statistical probability of each of the rest participles in the preset corpus set are introduced into a participle decoding stage to obtain the confusion information for evaluating the reasonable degree of the participle sequence, so as to provide a new basis for a participle decoding algorithm, and the robustness and the accuracy of the model participle effect in a low-resource scene can be improved without changing a coding model.
Illustratively, the conditional statistical probability of the kth participle may be determined by the following equation 2:
Figure BDA0003410383420000122
wherein, p (w)k-1,wk) Representing the probability that the kth participle appears after the kth-1 participle in the preset corpus, namely the probability that the kth participle and the kth-1 participle continuously appear;
p(wk-1) Representing the probability of the k-1 < th > participle appearing in the preset corpus.
In a possible implementation manner, a preset word bank can be obtained according to a preset corpus, wherein the preset word bank comprises a large number of words, the probability of each word appearing and the probability of each word appearing successively and adjacently to other words, which are determined based on the preset word bank. The probability of each word is determined according to the word frequency of each word in the word bank and the total number of words in the word list, and the probability of each word appearing successively and adjacently to other words is determined according to the number of times of each word appearing successively and adjacently to other words and the total number of words in the word bank.
In a possible implementation manner, the probability of each word appearing independently and the probability of each word appearing next to other words in sequence can be determined according to one or more preset word lists. For example, a first word list is constructed in advance according to the occurrence probability of each word, so that the corresponding relation between the word and the occurrence probability of the word is recorded in the first word list, a second word list is constructed according to the occurrence probability of each word and other words in sequence, so that the occurrence probability of the word and other words in sequence is recorded in the second word list, and therefore when the occurrence probability of a certain word and the occurrence probability of the word and other words in sequence are required to be obtained, the first word list and the second word list are inquired.
In the embodiment of the present disclosure, the preset corpus is a corpus that is pre-determined according to a standard corpus in the target domain. Taking the target field as the search field as an example, the standard corpus in the target field may be labeled search texts, and the labeled search texts constitute a standard corpus set. These search texts may originate from any search platform, and are not limited herein.
In order to ensure that the data volume in the preset data set is sufficient in consideration of the situation that the standard corpus is insufficient in some specific fields, in other possible implementation manners, the preset corpus may be constructed according to the standard corpus in the target field and an associated corpus of the standard corpus, where the associated corpus of the standard corpus is a corpus labeled based on the same character label labeling rule as the standard corpus, and the associated corpus belongs to a non-target field. For example, if the target domain is a search domain and the standard corpus in the search domain is insufficient, the corpus in the machine question and answer domain may be selected as the related corpus of the search domain to expand the preset corpus. The corpus labeled on the basis of the same character label labeling rule as the standard corpus is selected in other fields (non-target fields) to serve as an associated corpus of the standard corpus, and a preset corpus is generated on the basis of the standard corpus and the associated corpus, namely, the associated corpus from the non-target fields is used for expanding the preset corpus of the low-resource target field, so that the data volume of words in the preset corpus is increased.
Illustratively, if a "sanhao student" is labeled as "three | S, good | S, studying | B, and generating | E" in one corpus and "three | B, good | E, studying | B, and generating | E" in another corpus, the labeling rules based on the two corpuses are considered to be different.
It is easy to understand that the standard corpus of the target domain and the labeled corpus of the non-target domain each include a plurality of words labeled with character labels. Based on the above, the similarity between the standard corpus and the corpus in other fields can be determined according to whether the same vocabulary is labeled with the same character label in the standard corpus and the corpus in other fields, and whether the number of the vocabulary is larger than the preset number. For example, if the number of the same vocabulary in the standard corpus is greater than the preset number and the labeled character tags of the same vocabulary in the two corpora of the non-target domain are the same, it is determined that the corpus is the related corpus of the standard corpus. For another example, traversing vocabularies in the standard corpus to determine whether the vocabularies are included in a corpus of the non-target domain, if so, further determining whether the standard character tags in the two corporations are the same, if so, determining the vocabularies as target vocabularies, and if the number of the target vocabularies is greater than a preset number, determining the corpus of the non-target domain as an associated corpus of the standard corpus.
It can be seen that, in consideration of the problem that the standard corpus generated based on the corpus in the target field is insufficient in scale in the low-resource scene, the corpus labeled based on the same character label labeling rule as the standard corpus is selected in other fields (non-target fields) to serve as the associated corpus of the standard corpus. When the number of the same vocabularies in a certain corpus set and a standard corpus set in the non-target field is larger than the preset number and the labels of the same vocabularies in the two corpus sets are the same, the two corpus sets are considered to be the corpus set labeled based on the same character label labeling rule, and then the corpus set in the non-target field is determined to be the related corpus set of the standard corpus set. And finally, generating a preset corpus based on the standard corpus and the associated corpus, so as to improve the data volume of words in the preset corpus.
In some embodiments, the preset coding model may be a coding model obtained by training based on the preset corpus, and further, for a low-resource scene, the feature extraction capability of the coding model may be improved without changing the structure of the coding model.
In some possible implementations of S203, first determining a first weight and a second weight sum corresponding to each tag sequence, where the first weight sum is a sum of transmission weights corresponding to each character tag in the tag sequence, and the second weight sum is a sum of state transition weights corresponding to each character tag in the tag sequence; then, determining the scores of the label sequences according to the first weight sum, the second weight sum and the confusion degree corresponding to the label sequences; and finally, determining the label sequence corresponding to the maximum score as a target label sequence.
Illustratively, the score for a certain tag sequence may be determined according to equation 3 below:
Figure BDA0003410383420000141
wherein S isiA score representing the ith tag sequence;
ejrepresenting the corresponding transmission weight of the jth character tag in the ith tag sequence;
tjrepresenting the state transition weight corresponding to the jth character label in the ith label sequence;
PP(Si) Indicating the confusion of the ith tag sequence.
In some possible implementations, the score for each tag sequence may be determined according to equation 4 below:
Figure BDA0003410383420000142
wherein, We、WtAnd WpRespectively, are preset weighting coefficients.
As can be seen from the above formulas 3 and 4, the text word segmentation method provided in the embodiment of the present disclosure combines the emission weight and the state transition weight corresponding to each tag in the tag sequence learned by the coding model in the word segmentation stage and the confusion degree corresponding to each tag sequence, and selects an optimal tag sequence, thereby ensuring the robustness and accuracy of the word segmentation effect in a low-resource scene. Compared with the related technology, the technical scheme of the application can complete word segmentation without a large amount of computing resources, is suitable for industrial scenes, does not need to promote word segmentation expression of a word segmentation model by stacking a plurality of task models, and achieves the technical effects by improving a coding algorithm.
In specific implementation, the viterbi algorithm can be used to combine the transmission weight and the state transition weight corresponding to each character tag in each tag sequence, and the corresponding P (w) in the participle sequence corresponding to each tag sequence1) And P (w)k|wk-1) And determining the target label sequence from all the label sequences. Based on the embodiments of the present disclosure, it is clear to those skilled in the art how to apply the viterbi algorithm to the embodiments of the present disclosure, and therefore, the details are not described herein.
A low-resource scene is simulated, and the text word segmentation method provided by the embodiment of the disclosure is verified by using a plurality of data sets, and the result is shown in the following table. The models referred to in the following table include a Knowledge Distillation (KD) trained pre-training model in combination with Softmax activation function or Conditional Random Fields (CRF), and the referred datasets include AS, PKU traditional chinese datasets, MSR, CITYU simplified chinese datasets, CTB datasets, SXU datasets, microblog datasets, ZX datasets.
Figure BDA0003410383420000151
As can be seen from the above table, the text word segmentation method provided by the embodiment of the present disclosure shows better robustness and accuracy, especially for low resource scenarios, such as simulated low resource scenarios with data set utilization rates of 10% and 80%.
As can be seen from the above embodiments, in the text word segmentation method provided in the embodiments of the present disclosure, in the word segmentation decoding stage, the confusion degree of a certain label sequence is obtained according to the probability that each word segmentation in the word segmentation sequence corresponding to the label sequence occurs under the condition that one word segmentation occurs on the certain label sequence, so as to evaluate the reasonable degree of the label sequence, and an optimal label sequence is selected in combination with the weight of each label sequence learned in the word segmentation encoding stage, thereby ensuring the robustness and accuracy of the word segmentation effect in a low-resource scene. Compared with the related technology, the technical scheme of the application can complete word segmentation without a large amount of computing resources, is suitable for industrial scenes, does not need to promote word segmentation expression of a word segmentation model by stacking a plurality of task models, and achieves the technical effects by improving a coding algorithm.
Fig. 4 is a block diagram illustrating a text word segmentation apparatus according to an exemplary embodiment, and as shown in fig. 4, the text word segmentation apparatus provided in the embodiment of the present disclosure includes a tagsequence obtaining unit 401, anencoding unit 402, and adecoding unit 403.
A tagsequence obtaining unit 401, configured to obtain multiple tag sequences corresponding to a text to be word segmented, where the tag sequences are used to segment the text to be word segmented into corresponding word segmentation sequences. For example, as shown in fig. 2, the tagsequence acquiring unit 401 may be configured to execute S201.
Theencoding unit 402 is configured to perform encoding processing on the text to be word segmented to obtain a weight corresponding to each tag sequence; for example, as shown in fig. 2, theencoding unit 402 may be configured to perform S202.
Adecoding unit 403, configured to decode to obtain a target tag sequence according to a weight and a confusion degree corresponding to each tag sequence, so as to determine a word segmentation result by using the target tag sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence. For example, as shown in fig. 2, thedecoding unit 403 may be configured to perform S203.
In some embodiments, thedecoding unit 403 is specifically configured to: for the word segmentation sequence corresponding to each label sequence, determining the statistical probability of the first word segmentation in the word segmentation sequence and the conditional statistical probability of each of the rest word segmentations, wherein the conditional statistical probability of the kth word segmentation refers to the statistical probability of the kth word segmentation under the condition of the kth-1 word segmentation, and k is a positive integer greater than 1; determining the confusion degree of the label sequence according to the statistical probability of the first participle and the conditional statistical probability of each rest participles; and determining a target label sequence according to the corresponding weight and the confusion degree of each label sequence.
In some embodiments, thedecoding unit 403 is specifically configured to determine a probability of the first participle appearing in a preset corpus as a statistical probability of the first participle appearing; determining the continuous occurrence probability of two adjacent participles in a preset corpus set and the occurrence probability of each participle in the preset corpus set in all the other participles; and determining the conditional statistical probability of the kth participle according to the continuous occurrence probability of the kth participle and the kth-1 participle and the occurrence probability of the kth-1 participle.
In some embodiments, the preset corpus is constructed according to a standard corpus and an associated corpus of the standard corpus in a target field, the associated corpus of the standard corpus refers to a corpus labeled based on a same character label labeling rule as the standard corpus, the associated corpus belongs to a non-target field, the character label labeling rule refers to a rule for labeling a character label for a text in the corpus, and the character label is used for labeling a position of each character in the text in a participle of the text.
In some embodiments, the standard corpus and the associated corpus each include a number of texts, each text includes a number of participles, and characters in the participles are labeled with character labels; the number of the same participles in the standard corpus set and the associated corpus set is larger than a preset number, and the labeled character labels of the characters in the same participle in the standard corpus set and the labeled character labels in the associated corpus set are the same.
In some embodiments, the tag sequence includes a character tag corresponding to each character in the text to be participled, and the character tag is used for marking the position of the character in the participle; the weight corresponding to the label sequence comprises a transmitting weight and a state transition weight corresponding to each character label in the label sequence.
In some embodiments, thedecoding unit 403 is specifically configured to: determining a first weight and a second weight sum corresponding to each label sequence, wherein the first weight sum is the sum of the emission weights corresponding to each character label in the label sequence, and the second weight sum is the sum of the state transition weights corresponding to each character label in the label sequence; determining the scores of the label sequences according to the first weight sum, the second weight sum and the confusion degree corresponding to the label sequences; and determining the label sequence corresponding to the maximum score as the target label sequence.
In some embodiments, theencoding unit 402 is specifically configured to: and inputting the text to be participled into a coding model trained by utilizing the preset corpus, and outputting the emission weight and the state transition weight corresponding to each character tag in each tag sequence.
According to the text word segmentation device provided by the embodiment of the disclosure, in the word segmentation decoding stage, the confusion degree of a certain label sequence is obtained according to the probability that each word segmentation in the word segmentation sequence corresponding to the label sequence appears under the condition that the word segmentation appears, so as to evaluate the reasonable degree of the label sequence, and the optimal label sequence is selected by combining the weight of each label sequence learned in the word segmentation encoding stage, so that the robustness and the accuracy of the word segmentation effect in a low-resource scene are ensured. Compared with the related technology, the technical scheme of the application can complete word segmentation without a large amount of computing resources, is suitable for industrial scenes, does not need to promote word segmentation expression of a word segmentation model by stacking a plurality of task models, and achieves the technical effects by improving a coding algorithm.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
Fig. 5 is a schematic structural diagram of a server provided by the present disclosure. As in fig. 5, theserver 50 may include at least oneprocessor 501 and amemory 503 for storing processor-executable instructions. Wherein theprocessor 501 is configured to execute instructions in thememory 503 to implement the entity identification method in the above-described embodiments.
Additionally, theserver 50 may also include acommunication bus 502 and at least onecommunication interface 504.
Theprocessor 501 may be a Central Processing Unit (CPU), a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of programs according to the present disclosure.
Thecommunication bus 502 may include a path that conveys information between the aforementioned components.
Thecommunication interface 504 may be any device, such as a transceiver, for communicating with other devices or communication networks, such as an ethernet, a Radio Access Network (RAN), a Wireless Local Area Network (WLAN), etc.
Thememory 503 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be self-contained and connected to the processing unit by a bus. The memory may also be integrated with the processing unit.
Thememory 503 is used for storing instructions for executing the disclosed solution, and is controlled by theprocessor 501. Theprocessor 501 is configured to execute instructions stored in thememory 503 to implement the functions of the disclosed method.
As an example, in conjunction with fig. 4, the functions implemented by the tagsequence acquiring unit 401, theencoding unit 402, and thedecoding unit 403 in the text segmentation apparatus are the same as those of theprocessor 501 in fig. 5.
In particular implementations,processor 501 may include one or more CPUs such as CPU0 and CPU1 in fig. 5 as an example.
In particular implementations,server 50 may include multiple processors, such asprocessor 501 andprocessor 507 in FIG. 5, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In particular implementations,server 50 may also include anoutput device 505 and aninput device 506, as one embodiment. Anoutput device 505, which is in communication with theprocessor 501, may display information in a variety of ways. For example, theoutput device 505 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. Theinput device 506 is in communication with theprocessor 501 and can accept user input in a variety of ways. For example, theinput device 506 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.
Those skilled in the art will appreciate that the configuration shown in FIG. 5 does not constitute a limitation ofserver 50, and may include more or fewer components than shown, or combine certain components, or employ a different arrangement of components.
In addition, the present disclosure also provides a computer-readable storage medium, wherein when the instructions in the computer-readable storage medium are executed by a processor of the server, the server is enabled to execute the text word segmentation method provided in the above embodiment.
In addition, the present disclosure also provides a computer program product comprising computer instructions which, when run on a server, cause the server to perform the text word segmentation method as provided in the above embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A text word segmentation method is characterized by comprising the following steps:
acquiring a plurality of label sequences corresponding to a text to be participled, wherein the label sequences are used for segmenting the text to be participled into corresponding word segmentation sequences;
coding the text to be word segmented to obtain the weight corresponding to each label sequence;
decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
2. The method of claim 1, wherein decoding the target tag sequence according to the corresponding weight and confusion of each tag sequence comprises:
for the word segmentation sequence corresponding to each label sequence, determining the statistical probability of the first word segmentation in the word segmentation sequence and the conditional statistical probability of each of the rest word segmentations, wherein the conditional statistical probability of the kth word segmentation refers to the statistical probability of the kth word segmentation under the condition of the kth-1 word segmentation, and k is a positive integer greater than 1;
determining the confusion degree of the label sequence according to the statistical probability of the first participle and the conditional statistical probability of each rest participles;
and determining a target label sequence according to the corresponding weight and the confusion degree of each label sequence.
3. The method of claim 2, wherein the determining the statistical probability of the occurrence of the first participle and the conditional statistical probability of each remaining participle in the participle sequence comprises:
determining the probability of the first participle appearing in a preset corpus set as the statistical probability of the first participle appearing;
determining the continuous occurrence probability of two adjacent participles in a preset corpus set and the occurrence probability of each participle in the preset corpus set in all the other participles;
and determining the conditional statistical probability of the kth participle according to the continuous occurrence probability of the kth participle and the kth-1 participle and the occurrence probability of the kth-1 participle.
4. The method according to claim 3, wherein the preset corpus is constructed according to a standard corpus in a target domain and an associated corpus of the standard corpus, the associated corpus of the standard corpus is a corpus labeled based on a same character label labeling rule as the standard corpus, the associated corpus belongs to a non-target domain, the character label labeling rule is a rule for labeling a character label for a text in the corpus, and the character label is used for labeling a position of each character in the text in a participle of the text.
5. The method according to claim 4, wherein the standard corpus and the associated corpus each include a plurality of texts, each text includes a plurality of participles, and characters in the participles are labeled with character labels; the number of the same participles in the standard corpus set and the associated corpus set is larger than a preset number, and the labeled character labels of the characters in the same participle in the standard corpus set and the labeled character labels in the associated corpus set are the same.
6. The text word segmentation method according to claim 1, wherein the label sequence includes a character label corresponding to each character in the text to be word segmented, and the character label is used for marking the position of the character in the word segmentation; the weight corresponding to the label sequence comprises a transmitting weight and a state transition weight corresponding to each character label in the label sequence.
7. A text segmentation apparatus, comprising:
the system comprises a label sequence acquisition unit, a word segmentation unit and a word segmentation unit, wherein the label sequence acquisition unit is used for acquiring a plurality of label sequences corresponding to a text to be word segmented, and the label sequences are used for segmenting the text to be word segmented into corresponding word segmentation sequences;
the coding unit is used for coding the text to be participled to obtain the weight corresponding to each label sequence;
the decoding unit is used for decoding to obtain a target label sequence according to the weight and the confusion degree corresponding to each label sequence so as to determine a word segmentation result by using the target label sequence; and determining the confusion degree corresponding to each label sequence according to the statistical probability of each participle appearing under the condition that the last participle appears in the participle sequence corresponding to the label sequence.
8. An electronic device, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the text segmentation method of any one of claims 1-6.
9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of a server, enable the server to perform the text segmentation method of any one of claims 1-6.
10. A computer program product comprising instructions, characterized in that the computer program product comprises computer instructions which, when run on a server, cause the server to perform the text segmentation method according to any one of claims 1 to 6.
CN202111530194.2A2021-12-142021-12-14Text word segmentation method, device, equipment and storage mediumActiveCN114218939B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202111530194.2ACN114218939B (en)2021-12-142021-12-14Text word segmentation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202111530194.2ACN114218939B (en)2021-12-142021-12-14Text word segmentation method, device, equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN114218939Atrue CN114218939A (en)2022-03-22
CN114218939B CN114218939B (en)2025-06-10

Family

ID=80701983

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202111530194.2AActiveCN114218939B (en)2021-12-142021-12-14Text word segmentation method, device, equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN114218939B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109033085A (en)*2018-08-022018-12-18北京神州泰岳软件股份有限公司The segmenting method of Chinese automatic word-cut and Chinese text
US20200311205A1 (en)*2019-03-262020-10-01Siemens AktiengesellschaftSystem and method for natural language processing
CN112084334A (en)*2020-09-042020-12-15中国平安财产保险股份有限公司Corpus label classification method and device, computer equipment and storage medium
CN113743107A (en)*2021-08-302021-12-03北京字跳网络技术有限公司Entity word extraction method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109033085A (en)*2018-08-022018-12-18北京神州泰岳软件股份有限公司The segmenting method of Chinese automatic word-cut and Chinese text
US20200311205A1 (en)*2019-03-262020-10-01Siemens AktiengesellschaftSystem and method for natural language processing
CN112084334A (en)*2020-09-042020-12-15中国平安财产保险股份有限公司Corpus label classification method and device, computer equipment and storage medium
CN113743107A (en)*2021-08-302021-12-03北京字跳网络技术有限公司Entity word extraction method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡德敏;褚成伟;胡晨;胡钰媛;: "预训练模型下融合注意力机制的多语言文本情感分析方法", 小型微型计算机系统, no. 02, 15 February 2020 (2020-02-15)*

Also Published As

Publication numberPublication date
CN114218939B (en)2025-06-10

Similar Documents

PublicationPublication DateTitle
US12288027B2 (en)Text sentence processing method and apparatus, computer device, and storage medium
US12271701B2 (en)Method and apparatus for training text classification model
CN112560912B (en) Classification model training methods, devices, electronic equipment and storage media
JP7430820B2 (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN108363790B (en)Method, device, equipment and storage medium for evaluating comments
CN111222305B (en)Information structuring method and device
CN110597961B (en)Text category labeling method and device, electronic equipment and storage medium
US12353835B2 (en)Model training method and method for human-machine interaction
CN112232086A (en) A semantic recognition method, device, computer equipment and storage medium
CN110941958B (en)Text category labeling method and device, electronic equipment and storage medium
WO2023137911A1 (en)Intention classification method and apparatus based on small-sample corpus, and computer device
CN111832312B (en)Text processing method, device, equipment and storage medium
CN110929532B (en)Data processing method, device, equipment and storage medium
CN112036186B (en) Corpus annotation method, device, computer storage medium and electronic device
CN112101042A (en)Text emotion recognition method and device, terminal device and storage medium
CN113850383B (en)Text matching model training method and device, electronic equipment and storage medium
CN116304081A (en) Language model pre-training method, text processing method and related equipment
CN119151012A (en)Text training sample generation method and device based on large model and electronic equipment
CN117033961A (en)Multi-mode image-text classification method for context awareness
CN113095063B (en)Two-stage emotion migration method and system based on shielding language model
CN116450781A (en)Question and answer processing method and device
CN111090720B (en)Hot word adding method and device
CN110969005A (en)Method and device for determining similarity between entity corpora
CN113486143A (en)User portrait generation method based on multi-level text representation and model fusion
CN114722827B (en)Model training method, device and equipment for task processing model and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp