wherein, g (w) represents a word representation value of the first target word, E (w') is a first representation mean value of the first target word, and S _ m (w) is a word set with word morphology similar to that of the first target word w; the word representation calculation mode of the second target word may refer to the word representation calculation formula of the first target word, and will not be described in detail herein.

According to the text processing method provided by the embodiment of the specification, the word representation of the target word is calculated through the word representation mean value of the participle in the target word or the word representation mean value associated with the target word, so that the word representation of the target word can be accurately determined, and the word representation of the target word is conveniently added into the initial word list for updating.

The second calculation mode of the word representation of the target word is as follows: the method not only can determine the weighted value of each participle or the weighted value of the initial word with the association relation with the target word through the word representation mean value of the participle or the word representation mean value of the initial word with the association relation with the target word, but also can determine the weighted value of each participle or the weighted value of the initial word with the association relation with the target word based on the semantic information of the target word, so as to adjust the weight proportion of different word mean value calculations, and realize more accurate calculation of the word representation of the target word.

determining a first weight value of each participle in a first target word based on semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or

Determining a second weight value of the initial word having an association relation with a second target word based on semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.

In practical application, a first weight value of each participle in a first target word can be determined through semantic information of the first target word, for example, the first target word is "soda water", and it can be known through the semantic information that the participle "water" in the first target word is higher than a weight value occupied by participle "steam" semantics, so that the first weight value of each participle is determined based on an initial word list, and a word representation of the first target word is calculated through a first representation mean value and the first weight value, and a specific calculation manner can refer to formula (2):

wherein g (W) represents a word feature value of the first target word, E (W') is a first feature mean value of the first target word, S _ m (W) is a word set with word morphology similar to that of the first target word W, and W is a trainable parameter matrix.

In practical application, a second weight value of the initial word having an association relationship with the second target word may be determined through semantic information of the second target word, for example, the second target word is "water", the initial word having an association relationship with the second target word is "ice water", "boiled water" or "water", and it can be known from the voice information that the semantic of "ice water" or "boiled water" is higher than the weight value of the initial word in "water", so that the second weight value of each initial word having an association relationship with the second target word may be determined based on the initial word list, and the word representation of the second target word is calculated through the second representation mean value and the second weight value, and the specific calculation method may refer to the above formula (2), and no redundant description is given here.

The text processing method provided in the embodiment of the present specification is based on a word representation calculation method of an attention mechanism, and by determining the weight of each participle or an initial word having an association relationship with a target word, and then calculating the word representation of the target word based on the weight and a representation mean, semantic weight information is added in the above manner, so that the expression meaning of the target word is more inaccurate, and further, the method of calculating the word representation of the target word is more effective.

The third calculation mode of the word representation of the target word is as follows: when the word segmentation weight value of the target word is calculated according to the semantic information or the weight value of the initial word having the incidence relation with the target word is calculated, the attention mechanism is integrated into the position information, so that the weight proportion of different words is realized, the position information of the word is indicated through the position information, and the calculation mode of the word representation of the target word is more accurate.

determining position information of each participle in a first target word, determining a first weight value of each participle in the first target word based on the position information and semantic information, and calculating a word representation of the first target word based on the first representation mean value and the first weight value of each participle; and/or

Determining position information of a second target word in an initial word having an association relationship with the second target word, determining a second weight value of the initial word having the association relationship with the second target word based on the position information and semantic information, and calculating a word representation of the second target word based on the second representation mean value and the second weight value.

In practical application, after the first target word is split, the position information of each word can be determined, and determining a first weight value of each participle in the first target word based on the position information and semantic information of each participle, such as, for the first target word steam water, the split word segmentation steam and the word segmentation water, the position of the word segmentation steam is the head of the first target word, the position of the word segmentation water is the tail of the first target word, therefore, in the process of determining the first weight value of the participle "steam", the position information of the participle "steam" is considered, the first weight value of the participle is determined based on the position information and the semantic information, and further, calculating the word representation of the first target word according to the first weight value and the first representation mean value of each participle, wherein the specific calculation mode can refer to a formula (3):

wherein, g (W) represents a word feature value of the first target word, E (W') is a first feature mean value of the first target word, S _ m (W) is a word set with word morphology similar to that of the first target word W, W is a trainable parameter matrix, and I is position information of a participle.

In practical application, after the initial word having the association relationship with the second target word is determined, the position of the second target word in the initial word may be determined, for example, the second target word is "water", and the initial word having the association relationship with the second target word is "ice water", "boiled water" or "in water", so that the position information of the second target word in the initial word may be determined to be at the last position, and the first position, respectively, and therefore, the second weight value of each initial word having the association relationship with the second target word is determined based on the position information and the semantic information, and the word representation of the second target word is calculated based on the second representation mean value and the second weight value, and the specific calculation method may refer to the above formula (3), which is not described herein in detail.

It should be emphasized that, compared with two ways of only considering the mean value, only considering the weighted value of the semantic factor, and the mean value, the accuracy of the third way is higher, and in different task requirements, any one of the ways of calculating the word representation may be selected, which is not limited in this embodiment.

In the text processing method provided by the embodiment of the specification, in the process of determining the word representation of the target word, the attention mechanism is integrated into the position information of the word, and the weight value of the word can be more accurately expressed, so that a more accurate calculation result of the word representation is realized.

The preset representation generation rule in the above embodiment may be a word representation generator in practical application, and in order to continuously enhance the calculation accuracy of the word representation of the target word, parameters in the preset representation generation rule may also be adjusted; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:

determining a word representation loss value of the target word;

determining a candidate word representation of the initial text based on the word representations of the candidate words, determining an initial word representation of the initial text based on the word representations of the initial words;

determining a text representation loss value of the initial text based on the candidate word representations of the initial text and the initial word representations of the initial text;

and adjusting parameters of the preset representation generation rule based on the word representation loss value and the text representation loss value.

The term characterization loss value of the target term may be understood as a loss value calculated by the calculation formula in the above embodiment.

For example, if the initial text is "soda water contains a large amount of carbon dioxide", and its candidate words are "steam", "water", "medium", "containing", "a large amount", "dioxide", "carbon", then the candidate word tokens of the initial text are word tokens of all the candidate words.

The initial word representation of the initial text may be understood as a word representation of the initial text composed of the initial words, and following the above example, the initial words are "soda water", "medium", "containing", "large", "di", "oxide" and "carbon", respectively, and then the initial word representation of the initial text is the word representation of all the initial words.

Specifically, the loss value of the preset token generation rule is calculated in this embodiment through two aspects, on one hand, a word token loss value of each target word is calculated, and on the other hand, a loss value between the split and combined sentence and the original sentence is calculated as a text token loss value of the original sentence (a text token loss value of the original text), the two loss values are added to serve as an overall loss value of the preset token generation rule, and a corresponding parameter in the preset token generation rule is adjusted based on the overall loss value.

In practical applications, in order to reduce the above two loss values, the calculation expression of the word characterization loss value of the target word may refer to:

the text characterization loss value of the initial text can be calculated by referring to the following steps:

further, the overall loss value of the representation generation rule is preset:

in the text processing method provided in the embodiment of the present specification, the loss value of the target word and the loss values of the characteristic values of the text in which the target word is located and the characteristic values of the initial word are calculated, so as to determine the overall loss value of the preset characteristic generation rule, thereby adjusting parameters in the preset characteristic generation rule and improving the calculation accuracy of the preset characteristic generation rule.

Based on the determination of the word representation of the target word, the target word and the word representation of the target word can be updated into the initial word list, so that the initial word list is adapted to a second application scene; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:

determining alternative words from the initial word list based on scene requirements, and forming a target word list based on the alternative words and/or the target words; or

And adding the target words to the initial word list to obtain a target word list.

In practical application, there are two ways for determining the target vocabulary, one of which can determine an alternative word meeting the scene requirement from the initial vocabulary according to the scene requirement, and the target word and/or the alternative word determined according to the above embodiment are/is combined to form the target vocabulary; the target word list determined in the mode can meet the scene requirement, and is convenient for subsequent training of a model suitable for an application scene; in another way, the target words determined in the above embodiment are all added to the initial word list to obtain a target word list, and the target word list is formed in this way, which is applicable to both the first application scenario and the second application scenario, that is, based on the target word list, text processing can be realized in the first application scenario, and corresponding text processing can be realized in the second application scenario; it should be noted that the determination method of the target vocabulary by the text processing method provided in this embodiment is not limited to the above two manners.

Through the introduction of the text processing method provided in the present specification by the foregoing embodiment, it can be seen that the word representation migration process is a process of forming a target word list, and is intended to process an initial word list, so that word lists associated with different application scenarios can be formed to adapt to more application scenarios, and referring to fig. 2, fig. 2 shows a schematic diagram of implementing word representation migration by the text processing method provided in the present specification.

The first behavior in fig. 2 is an original word representation (a word representation of an initial word), where the original initial text is "soda water contains a large amount of carbon dioxide", each initial word is randomly divided into the initial texts, and the word representations of each initial word can be respectively represented as E (soda water), E (middle), E (containing), E (large amount), E (second), E (oxidation), and E (carbon). The last behavior word in fig. 2 represents a migration schematic process, and the word tokens corresponding to the candidate words after the initial text is randomly split or merged are represented as G (steam), G (water), E (middle), E (including), E (large amount), G (dioxide), and E (carbon), so that the word tokens of G (steam), G (water), and G (dioxide) are not present in the word tokens of the initial words, and therefore, in the process of migrating the word tokens, the word tokens of the target words "steam", "water", and "dioxide" need to be calculated, and further, the word tokens of G (steam), G (water), and G (dioxide) can be calculated based on the word token calculation method of the target words, and then the word tokens of the target words are input to a pre-training model as training data, that is, the word tokens of the target words and the target words are used as training data, the trained text processing model can be adapted to another application scene.

Based on the method, after word characteristics can be calculated for the target words based on the preset characteristic generation rule, training of the model can be realized; specifically, after the word representation of the target word is calculated based on the preset representation generation rule, the method further includes:

determining a word representation of the initial word;

training a target text processing model based on the initial words, the word tokens of the initial words, the target words, and the word tokens of the target words.

In practical applications, the pre-training model may determine a word representation of each initial word in the initial text, and the manner of determining the target word and the manner of calculating the word representation of the target word may be applied to this embodiment, which is not described herein in detail.

According to the text processing method provided by the embodiment of the specification, the text processing model is trained through the determined target words, the word representations of the initial words and the word representations of the initial words, and the mode can enable the model to flexibly select the specific word list in an actual project scene, so that the effect of the model in the project scene is improved.

Further, said training a target text processing model based on said initial words, said word tokens of said initial words, said target words, and said word tokens of said target words comprises:

training an initial text processing model based on the initial words and word characteristics of the initial words;

and inputting the target words and the word characteristics of the target words into the initial text processing model, and training the initial text processing model to obtain the target text processing model.

In practical application, in the process of pre-training the model, an initial text processing model can be trained based on initial words and word tokens of the initial words, the initial text processing model may only be suitable for a small part of application scenarios, and problems such as excessive segmentation of part of domain-specific proper nouns or sentences may also exist, so that the initial text processing model has a poor text processing effect and is not universal. Based on the mode, the required target word lists can be extracted according to different project scenes and data requirements, so that a project special model is realized, and the project special model is used for training under the project scene data to obtain a target text processing model capable of being actually delivered.

According to the text processing method provided by the embodiment of the specification, the initial text processing model is trained, and then the target text processing model is obtained, so that the target text processing model is suitable for text processing under different project scenes in the following process.

The following will further describe the text processing method by taking an application of the text processing method provided in this specification to a text processing model as an example with reference to fig. 3. Fig. 3 is a schematic diagram illustrating a model application of a text processing method according to an embodiment of the present disclosure.

Fig. 3 can be divided into two major parts, the first part is a pre-training part, and the second part is a model application part.

Firstly, a pre-training model is trained through pre-training data, wherein the pre-training data can be understood as a text sentence, the text sentence is split into at least two initial words based on an intermediate layer of the pre-training model, such as the divided initial words of 'reporter', 'worker', 'motorcycle', 'author' and 'vehicle' in fig. 3, the initial words are input into a representation generator of the pre-training part, the representation generator randomly splits or combines the text sentence again, the processed text sentence can determine at least two candidate words, the candidate words are matched with the initial words, the target words can be determined as 'person' and 'motorcycle', the representation generator can calculate corresponding word tokens for the target words of 'person' and 'motorcycle', and meanwhile, the target words can be added into a word list of the model application part, obtaining a target word list suitable for a project scene, wherein the target word list can be obtained in two ways, one way is that a word meeting the requirements of the project scene is screened from an initial word list of a pre-training part based on the requirements of the project scene, the word and the target word together form the target word list, and the other way is that the target word is merged into all words in the initial word list to obtain the target word list; the downstream data in fig. 3 may be understood as project data in different application scenarios, and the downstream model may be understood as a text processing model for adjusting a pre-training model for a target vocabulary, so as to be suitable for the project scenario, in an actual application, after the downstream model acquires the target vocabulary, the downstream model may perform text processing on the downstream data, and a vocabulary generator is used to calculate a vocabulary token of a target word in the target vocabulary, so as to implement a subsequent processing process of the downstream data by the model.

The text processing method provided by the embodiment of the specification is used for training the text processing model suitable for the project scene by determining the target words suitable for the project scene and calculating the word tokens of the target words by means of the word token generator, and is applied to text processing of project data, so that the accuracy rate and the processing efficiency of text processing are improved.

Corresponding to the above method embodiment, this specification further provides a text processing apparatus embodiment, and fig. 4 shows a schematic structural diagram of a text processing apparatus provided in an embodiment of this specification. As shown in fig. 4, the apparatus includes:

a determining module 402 configured to determine initial words of the initial text in the initial vocabulary;

an obtaining module 404, configured to process the initial text based on a preset processing rule, and obtain candidate words of the processed initial text;

a comparison module 406 configured to compare the initial word with the candidate word, and use a candidate word that does not match the initial word as a target word;

a calculation module 408 configured to calculate a word representation of the target word based on a preset representation generation rule.

Optionally, the apparatus further comprises:

a training module configured to determine a word characterization of the initial word;

Optionally, the apparatus further comprises:

an obtaining module configured to determine alternative words from the initial word list based on scene requirements, and form a target word list based on the alternative words and/or the target words; or

Optionally, the training module is further configured to:

Optionally, the obtaining module 404 is further configured to:

Optionally, the apparatus further comprises:

the representation mean value determining module is configured to determine a word representation of each participle in the first target word and determine a first representation mean value based on the word representation of each participle; and/or

Optionally, the calculation module 408 is further configured to:

Optionally, the apparatus further comprises:

an adjustment module configured to determine a term characterization loss value for a target term;

The text processing device provided by the embodiment of the specification screens out target words of unmatched initial words and candidate words by comparing the initial words of the initial text with the candidate words in the processed initial text, and then based on a preset representation generation rule, can calculate word representations of the target words in the processed initial text, and is convenient for being subsequently adapted to requirements of different application scenes on word representations of the target words in the initial text.

The above is a schematic scheme of a text processing apparatus of the present embodiment. It should be noted that the technical solution of the text processing apparatus and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the text processing apparatus can be referred to the description of the technical solution of the text processing method.

FIG. 5 illustrates a block diagram of acomputing device 500 provided in accordance with one embodiment of the present description. The components of thecomputing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.

Computing device 500 also includes access device 540, access device 540 enablingcomputing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components ofcomputing device 500, as well as other components not shown in FIG. 5, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC.Computing device 500 may also be a mobile or stationary server.

Wherein the processor 520 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the text processing method described above.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the text processing method.

An embodiment of the present specification further provides a computer-readable storage medium storing computer-executable instructions, which when executed by a processor implement the steps of the above-mentioned text processing method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the text processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the text processing method.

An embodiment of the present specification further provides a computer program, wherein when the computer program is executed in a computer, the computer is caused to execute the steps of the text processing method.

The above is an illustrative scheme of a computer program of the present embodiment. It should be noted that the technical solution of the computer program and the technical solution of the text processing method belong to the same concept, and details that are not described in detail in the technical solution of the computer program can be referred to the description of the technical solution of the text processing method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A text processing method, comprising:

determining initial words of initial texts in an initial word list;

2. The text processing method of claim 1, after calculating the word representation of the target word based on a preset representation generation rule, further comprising:

determining a word representation of the initial word;

3. The text processing method of claim 1, after calculating the word representation of the target word based on a preset representation generation rule, further comprising:

4. The text processing method of claim 2, said training a target text processing model based on said initial words, said word tokens of said initial words, said target words, and said word tokens of said target words, comprising:

5. The text processing method according to claim 1, wherein the processing the initial text based on a preset processing rule to obtain candidate words of the processed initial text comprises:

6. The text processing method according to any one of claims 1 to 5, before calculating the word representation of the target word based on the preset representation generation rule, further comprising:

7. The text processing method of claim 6, wherein said calculating a word representation of the target word based on a preset representation generation rule comprises:

8. The text processing method of claim 6, wherein said calculating a word representation of the target word based on a preset representation generation rule comprises:

9. The text processing method of claim 6, wherein said calculating a word representation of the target word based on a preset representation generation rule comprises:

10. The text processing method of claim 8, after calculating the word representation of the target word based on the preset representation generation rule, further comprising:

determining a word representation loss value of the target word;

11. A text processing apparatus comprising:

12. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, which when executed by the processor implement the steps of the text processing method of any one of claims 1 to 10.

13. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the text processing method of any one of claims 1 to 10.

14. A computer program which, when executed on a computer, causes the computer to carry out the steps of the text processing method of any one of claims 1 to 10.