Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Pre-training the model: the resulting model was trained on large scale unlabeled data.
A word list: when the natural language neural network model processes a statement, words are firstly mapped into a mathematical expression, and a word list is used as a query in the process.
In practical application, the pre-trained model vocabulary is usually fixed, cannot be adjusted according to task data in other subsequent project scenes, and excessively segments proper nouns or sentences in a specific field, so that the pre-trained model can hardly understand the sentences in the actual project scenes and cannot effectively process the sentences, and the effect of the pre-trained model is restricted. For the above scenario, specific optimization can be performed on subsequent task data, for example, a new vocabulary is retrained by using other data or a new vocabulary is trained by using downstream data, and for this solution, resources are consumed in each specific project scenario to perform tedious vocabulary retraining, which has no generality, and the model performance will be affected by completely discarding the initial vocabulary.
Based on this, the text processing method provided in the embodiments of the present specification provides a text processing method for pluggable vocabulary migration that is proposed for the problem of subword segmentation difference in a language generation model, and also provides a new pre-training model application process, and in the case that a vocabulary in a new vocabulary does not appear in an original vocabulary, a vocabulary token generator can be used to generate mathematical representations of the new vocabulary. The word representation generator comprehensively analyzes the mathematical representation of the old words with similar form to the new words, and calculates the mathematical representation of the new words through a neural network model, so that the expected word list is allowed to be used in the downstream actual project scene; when the word representation generator is trained, the mode of randomly splitting and combining the old words in the pre-training model to generate new words is adopted, and the model parameters of the representation generator are trained and updated together by calculating the training loss value of the pre-training model and pulling up the loss value of the sentences formed by combining the original sentence identifications and the splits. The method can lead the model to flexibly select the specific word list in the actual project scene, and further improve the effect of the model in the project occasion.
It should be noted that the pre-training model involved in the text processing method provided in the embodiments of the present specification may be understood as any text processing model, such as a translation model. The following embodiments are described in detail by taking a text translation model as an example.
In the present specification, a text processing method is provided, and the present specification relates to a text processing apparatus, a computing device, a computer-readable storage medium, and a computer program, which are described in detail one by one in the following embodiments.
In order to obtain more model training data and realize that a text processing model is suitable for different project scenes, the text processing method provided by the embodiment provides a word representation generation mode of target words, and corresponding word representations can be generated for different target words by using a preset representation generation rule so as to generate a word list suitable for the project scenes, and the word list can also be applied to the text processing model; detailed description referring to fig. 1, fig. 1 is a flowchart illustrating a text processing method according to an embodiment of the present disclosure, which includes the following steps.
Step 102: an initial word of the initial text is determined in an initial vocabulary.
Wherein, the initial word list can be understood as the words of the initial training data in the pre-training model; the initial text may be understood as a text sentence to be processed.
It should be noted that, this embodiment is exemplified by detailed processing of the initial text in the initial training data in the pre-training model, and the same processing may be performed on a plurality of texts at the same time according to the following manner, which is not limited in this embodiment.
In practical application, the server may divide an initial text, where the initial text may be understood as a text statement in a first application scenario of the pre-training model, and determine an initial word corresponding to the divided text statement in an initial word list of the pre-training model, where a process of processing the text statement to obtain the initial word may be divided based on a text processing rule in the first application scenario, and this embodiment does not limit any specific dividing manner.
It should be noted that the first application scenario may be understood as a first application scenario to which the pre-training model can be applied, for example, the pre-training model is a text translation model, and the first application scenario to which the pre-training model can be applied is a scientific translation scenario, which means that a vocabulary in the text translation model in the scenario is more biased to a scientific vocabulary.
For example, the first application scenario is a scientific text processing scenario, the selected first initial text is that "soda water contains a large amount of carbon dioxide", the server pre-training model can divide the initial text, and the finally obtained initial words are respectively "soda water", "middle", "containing", "large", "two", "oxidized" and "carbon".
Step 104: and processing the initial text based on a preset processing rule to obtain candidate words of the processed initial text.
The preset processing rule may be understood as a rule for dividing and/or merging words for the initial text, and each of the different application scenarios may be adapted to different processing rules, which is not limited in this embodiment.
In practical application, in order to adapt to different project application scenarios, the initial text may be processed again according to the preset processing rule to obtain candidate words of the processed initial text, so as to facilitate subsequently screening out words suitable for another project application scenario from the candidate words, and implement a new word list suitable for the project application scenario.
Specifically, the processing the initial text based on the preset processing rule to obtain the candidate words of the processed initial text includes:
splitting and/or merging the words in the initial text based on a preset processing rule to obtain candidate words of the initial text.
In practical application, in order to adapt to a second application scenario of the pre-training model, the second application scenario may be a living translation scenario, and then the words in the initial text are split and/or merged based on a preset processing rule in the second application scenario, the splitting may be understood as splitting words in a group of words, for example, a word "soda" may be split into "steam" and "water", and the merging may be understood as merging two adjacent groups of words, for example, two words "two" and "two oxidized" are merged into "two oxidized", so that the words after being split and/or merged are used as candidate words of the initial text.
In the text processing method provided by the embodiment of the specification, the initial text is processed again through the preset processing rule to obtain the candidate words, so that words different from the initial words can be screened out from the candidate words in the second application scene subsequently, and further word characteristics of the words are obtained.
Step 106: and comparing the initial word with the candidate word, and taking the candidate word which is not matched with the initial word as a target word.
In order to determine the translation terms of the initial text in the second application scenario that are different from those required by the first application scenario, in practical applications, the initial terms in the first application scenario and the candidate terms in the second application scenario need to be compared, the candidate terms that are not matched with the initial terms are screened out to be used as target terms, and then the word representation of the target terms is determined, so that the word representation of the target terms is updated to the initial word list in a subsequent manner, and the initial word list is suitable for the first application scenario and also suitable for the second application scenario.
Step 108: and calculating the word characteristics of the target words based on a preset characteristic generation rule.
The preset representation generation rule may be understood as a word representation rule for calculating a word, may be a representation generator, or may be a representation calculation formula, which is not limited in this embodiment, and the following embodiments provide three word representation calculation methods to determine a word representation of a target word, but the embodiments of the present disclosure are not limited to the following three calculation methods.
In order to determine the accurate word representation of the target word, the target word can be divided into two situations, one situation is that the target word which can be continuously split is called as a first target word, the other situation is that the target word which cannot be split is called as a second target word, and then the word representations of different target words can be respectively calculated according to the representation mean values aiming at the two situations; specifically, before the calculating the word representation of the target word based on the preset representation generation rule, the method further includes:
determining a word characteristic of each participle in a first target word, and determining a first characteristic mean value based on the word characteristic of each participle; and/or
Determining a word representation of the initial word having an association relationship with the second target word, and determining a second representation mean based on the word representation of the initial word having an association relationship with the second target word.
The first target word can be understood as a target word which can be further split into participles, for example, a target word of "soda water", and two participles, namely "steam" and "water", can be split.
The second target word can be understood as a target word which cannot be continuously split into participles, for example, the target word of "water", and the participles cannot be continuously split.
In specific implementation, the pre-training model can determine the word representation of each participle in the first target word, and further can determine the average value of the word representation of each participle as a first representation average value, wherein in practical application, the word representation of the first target word can be determined based on the average value of the word representation of each participle; the pre-training model can determine the word representation of the initial word of which the second target word has an association relationship because the second target word cannot be split into the segmented words, and then determine a second representation mean value based on the word representation of each initial word of which the association relationship with the second target word is associated.
For example, if the first target word is "steam water," the pre-training model may determine a word representation of each of the participles "steam" and a word representation of "water," which may be denoted as h ' (steam) and h ' (water), respectively, such that the first representation mean value h ' (steam water) of "steam water" is the average of h ' (steam) and h ' (water). If the second target word is "water", it may be determined from the initial word list that the initial word associated with "water" includes "ice water", "boiled water" and "in water", and further determine a word representation of "ice water" and a word representation of "boiled water", and the word representations of "in water" may be respectively identified as h ' (ice water), h ' (boiled water) and h ' (in water), so that the second representation mean value h ' (water) of "water" is an average value of h ' (ice water), h ' (boiled water) and h ' (in water).
It should be noted that the first characterization mean value and the second characterization mean value determined in this embodiment facilitate subsequent calculation of a characterization value of the target word, and the following three calculation manners for word characterization of the target word are developed based on the calculation of the first characterization mean value and the second characterization mean value, which is specifically described in the following embodiments.
The first calculation mode of the word representation of the target word is as follows: and taking the mean value of the word representation of each participle in the first target word as the word representation of the first target word, and taking the mean value of the word representation of the initial word having an association relation with the second target word as the word representation of the second target word.
Specifically, the calculating the word representation of the target word based on the preset representation generation rule includes:
calculating a word representation of the first target word based on the first representation mean; and/or
Calculating a word representation of the second target word based on the second representation mean.
In practical applications, the word representation of the first target word may be obtained based on the first representation mean, and the word representation of the second target word may be obtained based on the second representation mean, specifically, the word representation calculation manner of the first target word may refer to formula (1):
wherein, g (w) represents a word representation value of the first target word, E (w') is a first representation mean value of the first target word, and S _ m (w) is a word set with word morphology similar to that of the first target word w; the word representation calculation mode of the second target word may refer to the word representation calculation formula of the first target word, and will not be described in detail herein.
According to the text processing method provided by the embodiment of the specification, the word representation of the target word is calculated through the word representation mean value of the participle in the target word or the word representation mean value associated with the target word, so that the word representation of the target word can be accurately determined, and the word representation of the target word is conveniently added into the initial word list for updating.
The second calculation mode of the word representation of the target word is as follows: the method not only can determine the weighted value of each participle or the weighted value of the initial word with the association relation with the target word through the word representation mean value of the participle or the word representation mean value of the initial word with the association relation with the target word, but also can determine the weighted value of each participle or the weighted value of the initial word with the association relation with the target word based on the semantic information of the target word, so as to adjust the weight proportion of different word mean value calculations, and realize more accurate calculation of the word representation of the target word.
Specifically, the calculating the word representation of the target word based on the preset representation generation rule includes:
determining a first weight value of each participle in a first target word based on semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or
Determining a second weight value of the initial word having an association relation with a second target word based on semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.
In practical application, a first weight value of each participle in a first target word can be determined through semantic information of the first target word, for example, the first target word is "soda water", and it can be known through the semantic information that the participle "water" in the first target word is higher than a weight value occupied by participle "steam" semantics, so that the first weight value of each participle is determined based on an initial word list, and a word representation of the first target word is calculated through a first representation mean value and the first weight value, and a specific calculation manner can refer to formula (2):
wherein g (W) represents a word feature value of the first target word, E (W') is a first feature mean value of the first target word, S _ m (W) is a word set with word morphology similar to that of the first target word W, and W is a trainable parameter matrix.
In practical application, a second weight value of the initial word having an association relationship with the second target word may be determined through semantic information of the second target word, for example, the second target word is "water", the initial word having an association relationship with the second target word is "ice water", "boiled water" or "water", and it can be known from the voice information that the semantic of "ice water" or "boiled water" is higher than the weight value of the initial word in "water", so that the second weight value of each initial word having an association relationship with the second target word may be determined based on the initial word list, and the word representation of the second target word is calculated through the second representation mean value and the second weight value, and the specific calculation method may refer to the above formula (2), and no redundant description is given here.
The text processing method provided in the embodiment of the present specification is based on a word representation calculation method of an attention mechanism, and by determining the weight of each participle or an initial word having an association relationship with a target word, and then calculating the word representation of the target word based on the weight and a representation mean, semantic weight information is added in the above manner, so that the expression meaning of the target word is more inaccurate, and further, the method of calculating the word representation of the target word is more effective.
The third calculation mode of the word representation of the target word is as follows: when the word segmentation weight value of the target word is calculated according to the semantic information or the weight value of the initial word having the incidence relation with the target word is calculated, the attention mechanism is integrated into the position information, so that the weight proportion of different words is realized, the position information of the word is indicated through the position information, and the calculation mode of the word representation of the target word is more accurate.
Specifically, the calculating the word representation of the target word based on the preset representation generation rule includes:
determining position information of each participle in a first target word, determining a first weight value of each participle in the first target word based on the position information and semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or
Determining position information of a second target word in an initial word having an association relationship with the second target word, determining a second weight value of the initial word having the association relationship with the second target word based on the position information and semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.
In practical application, after the first target word is split, the position information of each word can be determined, and determining a first weight value of each participle in the first target word based on the position information and semantic information of each participle, such as, for the first target word steam water, the split word segmentation steam and the word segmentation water, the position of the word segmentation steam is the head of the first target word, the position of the word segmentation water is the tail of the first target word, therefore, in the process of determining the first weight value of the participle "steam", the position information of the participle "steam" is considered, the first weight value of the participle is determined based on the position information and the semantic information, and further, calculating the word representation of the first target word according to the first weight value and the first representation mean value of each participle, wherein the specific calculation mode can refer to a formula (3):
wherein, g (W) represents a word feature value of the first target word, E (W') is a first feature mean value of the first target word, S _ m (W) is a word set with word morphology similar to that of the first target word W, W is a trainable parameter matrix, and I is position information of a participle.
In practical application, after the initial word having the association relationship with the second target word is determined, the position of the second target word in the initial word may be determined, for example, the second target word is "water", and the initial word having the association relationship with the second target word is "ice water", "boiled water" or "in water", so that the position information of the second target word in the initial word may be determined to be at the last position, and the first position, respectively, and therefore, the second weight value of each initial word having the association relationship with the second target word is determined based on the position information and the semantic information, and the word representation of the second target word is calculated based on the second representation mean value and the second weight value, and the specific calculation method may refer to the above formula (3), which is not described herein in detail.
It should be emphasized that, compared with two ways of only considering the mean value, only considering the weighted value of the semantic factor, and the mean value, the accuracy of the third way is higher, and in different task requirements, any one of the ways of calculating the word representation may be selected, which is not limited in this embodiment.
In the text processing method provided by the embodiment of the specification, in the process of determining the word representation of the target word, the attention mechanism is integrated into the position information of the word, and the weight value of the word can be more accurately expressed, so that a more accurate calculation result of the word representation is realized.
The preset representation generation rule in the above embodiment may be a word representation generator in practical application, and in order to continuously enhance the calculation accuracy of the word representation of the target word, parameters in the preset representation generation rule may also be adjusted; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:
determining a word representation loss value of the target word;
determining a candidate word representation of the initial text based on the word representations of the candidate words, determining an initial word representation of the initial text based on the word representations of the initial words;
determining a text representation loss value of the initial text based on the candidate word representations of the initial text and the initial word representations of the initial text;
and adjusting parameters of the preset representation generation rule based on the word representation loss value and the text representation loss value.
The term characterization loss value of the target term may be understood as a loss value calculated by the calculation formula in the above embodiment.
For example, if the initial text is "soda water contains a large amount of carbon dioxide", and its candidate words are "steam", "water", "medium", "containing", "a large amount", "dioxide", "carbon", then the candidate word tokens of the initial text are word tokens of all the candidate words.
The initial word representation of the initial text may be understood as a word representation of the initial text composed of the initial words, and following the above example, the initial words are "soda water", "medium", "containing", "large", "di", "oxide" and "carbon", respectively, and then the initial word representation of the initial text is the word representation of all the initial words.
Specifically, the loss value of the preset token generation rule is calculated in this embodiment through two aspects, on one hand, a word token loss value of each target word is calculated, and on the other hand, a loss value between the split and combined sentence and the original sentence is calculated as a text token loss value of the original sentence (a text token loss value of the original text), the two loss values are added to serve as an overall loss value of the preset token generation rule, and a corresponding parameter in the preset token generation rule is adjusted based on the overall loss value.
In practical applications, in order to reduce the above two loss values, the calculation expression of the word characterization loss value of the target word may refer to:
the text characterization loss value of the initial text can be calculated by referring to the following steps:
further, the overall loss value of the representation generation rule is preset:
in the text processing method provided in the embodiment of the present specification, the loss value of the target word and the loss values of the characteristic values of the text in which the target word is located and the characteristic values of the initial word are calculated, so as to determine the overall loss value of the preset characteristic generation rule, thereby adjusting parameters in the preset characteristic generation rule and improving the calculation accuracy of the preset characteristic generation rule.
Based on the determination of the word representation of the target word, the target word and the word representation of the target word can be updated into the initial word list, so that the initial word list is adapted to a second application scene; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:
determining alternative words from the initial word list based on scene requirements, and forming a target word list based on the alternative words and/or the target words; or
And adding the target words to the initial word list to obtain a target word list.
In practical application, there are two ways for determining the target vocabulary, one of which can determine an alternative word meeting the scene requirement from the initial vocabulary according to the scene requirement, and the target word and/or the alternative word determined according to the above embodiment are/is combined to form the target vocabulary; the target word list determined in the mode can meet the scene requirement, and is convenient for subsequent training of a model suitable for an application scene; in another way, the target words determined in the above embodiment are all added to the initial word list to obtain a target word list, and the target word list is formed in this way, which is applicable to both the first application scenario and the second application scenario, that is, based on the target word list, text processing can be realized in the first application scenario, and corresponding text processing can be realized in the second application scenario; it should be noted that the determination method of the target vocabulary by the text processing method provided in this embodiment is not limited to the above two manners.
Through the introduction of the text processing method provided in the present specification by the foregoing embodiment, it can be seen that the word representation migration process is a process of forming a target word list, and is intended to process an initial word list, so that word lists associated with different application scenarios can be formed to adapt to more application scenarios, and referring to fig. 2, fig. 2 shows a schematic diagram of implementing word representation migration by the text processing method provided in the present specification.
The first behavior in fig. 2 is an original word representation (a word representation of an initial word), where the original initial text is "soda water contains a large amount of carbon dioxide", each initial word is randomly divided into the initial texts, and the word representations of each initial word can be respectively represented as E (soda water), E (middle), E (containing), E (large amount), E (second), E (oxidation), and E (carbon). The last behavior word in fig. 2 represents a migration schematic process, and the word tokens corresponding to the candidate words after the initial text is randomly split or merged are represented as G (steam), G (water), E (middle), E (including), E (large amount), G (dioxide), and E (carbon), so that the word tokens of G (steam), G (water), and G (dioxide) are not present in the word tokens of the initial words, and therefore, in the process of migrating the word tokens, the word tokens of the target words "steam", "water", and "dioxide" need to be calculated, and further, the word tokens of G (steam), G (water), and G (dioxide) can be calculated based on the word token calculation method of the target words, and then the word tokens of the target words are input to a pre-training model as training data, that is, the word tokens of the target words and the target words are used as training data, the trained text processing model can be adapted to another application scene.
Based on the method, after word characteristics can be calculated for the target words based on the preset characteristic generation rule, training of the model can be realized; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:
determining a word representation of the initial word;
training a target text processing model based on the initial words, the word tokens of the initial words, the target words, and the word tokens of the target words.
In practical applications, the pre-training model may determine a word representation of each initial word in the initial text, and the manner of determining the target word and the manner of calculating the word representation of the target word may be applied to this embodiment, which is not described herein in detail.
According to the text processing method provided by the embodiment of the specification, the text processing model is trained through the determined target words, the word representations of the initial words and the word representations of the initial words, and the mode can enable the model to flexibly select the specific word list in an actual project scene, so that the effect of the model in the project scene is improved.
Further, said training a target text processing model based on said initial words, said word tokens of said initial words, said target words, and said word tokens of said target words comprises:
training an initial text processing model based on the initial words and word characteristics of the initial words;
and inputting the target words and the word characteristics of the target words into the initial text processing model, and training the initial text processing model to obtain the target text processing model.
In practical application, in the process of pre-training the model, an initial text processing model can be trained based on initial words and word tokens of the initial words, the initial text processing model may only be suitable for a small part of application scenarios, and problems such as excessive segmentation of part of domain-specific proper nouns or sentences may also exist, so that the initial text processing model has a poor text processing effect and is not universal. Based on the mode, the required target word lists can be extracted according to different project scenes and data requirements, so that a project special model is realized, and the project special model is used for training under the project scene data to obtain a target text processing model capable of being actually delivered.
According to the text processing method provided by the embodiment of the specification, the initial text processing model is trained, and then the target text processing model is obtained, so that the target text processing model is suitable for text processing under different project scenes in the following process.
The following will further describe the text processing method by taking an application of the text processing method provided in this specification to a text processing model as an example with reference to fig. 3. Fig. 3 is a schematic diagram illustrating a model application of a text processing method according to an embodiment of the present disclosure.
Fig. 3 can be divided into two major parts, the first part is a pre-training part, and the second part is a model application part.
Firstly, a pre-training model is trained through pre-training data, wherein the pre-training data can be understood as a text sentence, the text sentence is split into at least two initial words based on an intermediate layer of the pre-training model, such as the divided initial words of 'reporter', 'worker', 'motorcycle', 'author' and 'vehicle' in fig. 3, the initial words are input into a representation generator of the pre-training part, the representation generator randomly splits or combines the text sentence again, the processed text sentence can determine at least two candidate words, the candidate words are matched with the initial words, the target words can be determined as 'person' and 'motorcycle', the representation generator can calculate corresponding word tokens for the target words of 'person' and 'motorcycle', and meanwhile, the target words can be added into a word list of the model application part, obtaining a target word list suitable for a project scene, wherein the target word list can be obtained in two ways, one way is that a word meeting the requirements of the project scene is screened from an initial word list of a pre-training part based on the requirements of the project scene, the word and the target word together form the target word list, and the other way is that the target word is merged into all words in the initial word list to obtain the target word list; the downstream data in fig. 3 may be understood as project data in different application scenarios, and the downstream model may be understood as a text processing model for adjusting a pre-training model for a target vocabulary, so as to be suitable for the project scenario, in an actual application, after the downstream model acquires the target vocabulary, the downstream model may perform text processing on the downstream data, and a vocabulary generator is used to calculate a vocabulary token of a target word in the target vocabulary, so as to implement a subsequent processing process of the downstream data by the model.
The text processing method provided by the embodiment of the specification is used for training the text processing model suitable for the project scene by determining the target words suitable for the project scene and calculating the word tokens of the target words by means of the word token generator, and is applied to text processing of project data, so that the accuracy rate and the processing efficiency of text processing are improved.
Corresponding to the above method embodiment, this specification further provides a text processing apparatus embodiment, and fig. 4 shows a schematic structural diagram of a text processing apparatus provided in an embodiment of this specification. As shown in fig. 4, the apparatus includes:
a determining module 402 configured to determine initial words of the initial text in the initial vocabulary;
an obtaining module 404, configured to process the initial text based on a preset processing rule, and obtain candidate words of the processed initial text;
a comparison module 406 configured to compare the initial word with the candidate word, and use a candidate word that does not match the initial word as a target word;
a calculation module 408 configured to calculate a word representation of the target word based on a preset representation generation rule.
Optionally, the apparatus further comprises:
a training module configured to determine a word characterization of the initial word;
training a target text processing model based on the initial words, the word tokens of the initial words, the target words, and the word tokens of the target words.
Optionally, the apparatus further comprises:
an obtaining module configured to determine alternative words from the initial word list based on scene requirements, and form a target word list based on the alternative words and/or the target words; or
And adding the target words to the initial word list to obtain a target word list.
Optionally, the training module is further configured to:
training an initial text processing model based on the initial words and word characteristics of the initial words;
and inputting the target words and the word characteristics of the target words into the initial text processing model, and training the initial text processing model to obtain the target text processing model.
Optionally, the obtaining module 404 is further configured to:
splitting and/or merging the words in the initial text based on a preset processing rule to obtain candidate words of the initial text.
Optionally, the apparatus further comprises:
the representation mean value determining module is configured to determine a word representation of each participle in the first target word and determine a first representation mean value based on the word representation of each participle; and/or
Determining a word representation of the initial word having an association relationship with the second target word, and determining a second representation mean based on the word representation of the initial word having an association relationship with the second target word.
Optionally, the calculation module 408 is further configured to:
calculating a word representation of the first target word based on the first representation mean; and/or
Calculating a word representation of the second target word based on the second representation mean.
Optionally, the calculation module 408 is further configured to:
determining a first weight value of each participle in a first target word based on semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or
Determining a second weight value of the initial word having an association relation with a second target word based on semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.
Optionally, the calculation module 408 is further configured to:
determining position information of each participle in a first target word, determining a first weight value of each participle in the first target word based on the position information and semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or
Determining position information of a second target word in an initial word having an association relationship with the second target word, determining a second weight value of the initial word having the association relationship with the second target word based on the position information and semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.
Optionally, the apparatus further comprises:
an adjustment module configured to determine a term characterization loss value for a target term;
determining a candidate word representation of the initial text based on the word representations of the candidate words, determining an initial word representation of the initial text based on the word representations of the initial words;
determining a text representation loss value of the initial text based on the candidate word representations of the initial text and the initial word representations of the initial text;
and adjusting parameters of the preset representation generation rule based on the word representation loss value and the text representation loss value.
The text processing device provided by the embodiment of the specification screens out target words of unmatched initial words and candidate words by comparing the initial words of the initial text with the candidate words in the processed initial text, and then based on a preset representation generation rule, can calculate word representations of the target words in the processed initial text, and is convenient for being subsequently adapted to requirements of different application scenes on word representations of the target words in the initial text.
The above is a schematic scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method.
FIG. 5 illustrates a block diagram of acomputing device 500 provided in accordance with one embodiment of the present description. The components of thecomputing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.
Computing device 500 also includes access device 540, access device 540 enablingcomputing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components ofcomputing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC.Computing device 500 may also be a mobile or stationary server.
Wherein the processor 520 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the text processing method described above.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text processing method.
An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the steps of the above-mentioned text processing method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text processing method.
An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the text processing method.
The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the text processing method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.