CN113033211A

Movatterモバイル変換

Info

Publication number: CN113033211A
Application number: CN202110321272.1A
Authority: CN
Inventors: 刘晓艺; 陈静萍; 张煜
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-25
Anticipated expiration: 2041-03-25
Also published as: CN113033211B

Abstract

Translated fromChinese

本申请公开了一种数据处理方法及装置，方法包括：获得目标数据，所述目标数据中包含有多个长语句；对所述长语句进行短句拆分，以得到多个子语句；获得所述子语句之间的语义逻辑关系；根据所述语义逻辑关系，将所述子语句进行组合，以得到多个新的长语句。

The present application discloses a data processing method and device. The method includes: obtaining target data, where the target data includes a plurality of long sentences; splitting the long sentences into short sentences to obtain a plurality of sub-sentences; Describe the semantic logical relationship between the sub-statements; according to the semantic logical relationship, combine the sub-statements to obtain a plurality of new long statements.

Description

Data processing method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.

Background

In the field of natural language processing, data enhancement is required because insufficient data causes many problems.

However, the enhancement data obtained by the currently adopted enhancement schemes such as vocabulary replacement, reverse translation, text surface conversion and the like do not change much compared with the original data, so that the reliability of data enhancement is low.

Disclosure of Invention

In view of the above, the present application provides a data processing method and apparatus, as follows:

a method of data processing, comprising:

obtaining target data, wherein the target data comprises a plurality of long sentences;

splitting the short sentence of the long sentence to obtain a plurality of sub-sentences;

obtaining semantic logical relations between the sub-sentences;

and combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences.

Preferably, in the method, before combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences, the method further includes:

performing data enhancement on the sub-sentences so that the number of the sub-sentences is increased.

The method preferably further comprises, after obtaining a plurality of new long sentences:

data enhancement is performed on one or more sub-sentences in the new long sentence such that the number of the new long sentence is increased.

The method preferably performs data enhancement on the sub-sentences, and includes any one or more of the following steps:

performing character replacement on the sub-sentence;

performing reverse translation on the sub-sentence;

and performing text conversion on the sub-sentence.

grouping the sub-sentences according to the semantic logical relationship to obtain a plurality of sentence groups;

wherein, according to the semantic logical relationship, combining the sub-sentences to obtain a plurality of new long sentences comprises:

and according to the semantic logical relationship, randomly selecting one sub-statement from any plurality of statement groups to be combined so as to obtain a plurality of new long statements.

In the method, preferably, the semantics of the sub-sentences in the same sentence group satisfy the semantic similarity condition.

Preferably, in the method, according to the semantic logical relationship, any one of the sub-sentences in any of the sentence groups is selected and combined to obtain a plurality of new long sentences, and the method includes:

obtaining one or more statement sets according to the semantic logical relationship, wherein the statement sets comprise a plurality of statement groups, and the sub-statements contained in the statement groups in the same statement set have the semantic logical relationship;

and aiming at the statement set, randomly selecting one sub statement from the statement group contained in the statement set to combine to obtain a plurality of new long statements, wherein the new long statements comprise a plurality of sub statements, and the sub statements contained in the new long statements have semantic logical relations.

Preferably, in the above method, obtaining one or more statement sets according to the semantic logical relationship includes:

establishing a map corresponding to the statement groups according to the semantic logical relationship, wherein the map comprises a plurality of nodes, at least one node corresponds to the statement groups, and at least two nodes are connected with each other, and the connection line between the nodes represents the semantic logical relationship between the statement groups corresponding to the connected nodes;

and acquiring one or more statement sets according to the nodes and the connecting lines in the graph, wherein the statement sets comprise a plurality of statement groups, and the connecting lines exist between the corresponding nodes in the graph for the statement groups in the same statement set.

In the above method, preferably, before the short sentence splitting is performed on the long sentence to obtain a plurality of sub-sentences, the method further includes:

performing data enhancement on the long sentences in the target data, so that the number of the long sentences in the target data is increased.

A data processing apparatus comprising:

the data acquisition unit is used for acquiring target data, wherein the target data comprises a plurality of long sentences, and the long sentences are sentences with complete semantics;

the sentence splitting unit is used for splitting the short sentences of the long sentences to obtain a plurality of sub-sentences;

a logic obtaining unit, configured to obtain semantic logic relationships between the sub-sentences;

and the sentence combination unit is used for combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences.

According to the scheme, after the long sentences are split, the sub-sentences are recombined according to the semantic logic relationship between the sub-sentences, so that a plurality of new sentences different from the long sentences before splitting are obtained, and the number of the long sentences is increased. Therefore, compared with the situation that data obtained by schemes such as vocabulary replacement, reverse translation, text surface conversion and the like do not change much compared with original data, the number of sentences can be substantially increased in the application, so that the data volume is greatly increased, the data enhancement is realized, and the purpose of improving the reliability of the data enhancement is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application;

fig. 2-fig. 6 are another flow charts of a data processing method according to an embodiment of the present application;

fig. 7 is a partial flowchart of a data processing method according to an embodiment of the present application;

FIGS. 8-12 are diagrams illustrating the processing of statements in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a data processing apparatus according to a second embodiment of the present application;

fig. 14 is another schematic structural diagram of a data processing apparatus according to a second embodiment of the present application;

fig. 15 is a schematic structural diagram of an electronic device according to a third embodiment of the present application;

FIG. 16 is an exemplary diagram of a prior art scheme employing example cross-boosting;

fig. 17-20 are diagrams illustrating examples of training data enhancement applied to deep learning according to embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a flowchart of an implementation of a data processing method provided in an embodiment of the present application is shown, where the method may be applied to an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for improving the reliability of data enhancement.

Specifically, the method in this embodiment may include the following steps:

step 101: target data is obtained.

The target data may be training data for performing natural language processing, such as sample data of model training. In this embodiment, the target data to be enhanced may be obtained through web page crawling or other manners.

Specifically, the target data includes a plurality of long sentences. Such as a text paragraph, in which a plurality of long sentences are contained, a long sentence is understood to be a sentence with complete semantics, for example a sentence ending with a period or question mark: "do you want to send repair if the mobile phone is turned on" and "how to repair if i break the mobile phone? "and the like.

Step 102: and splitting the short sentence of the long sentence to obtain a plurality of sub-sentences.

In this embodiment, the short sentence may be split from the long sentence in a preset splitting manner.

In one implementation, in this embodiment, the long sentence may be split into short sentences according to the positions of the punctuations in the long sentence. For example, in this embodiment, the long sentence is split according to the position of the short sentence connection symbol such as comma, colon, and pause in the long sentence, for example, "how to repair if i break the mobile phone? "split to get sub-sentences: "i break the mobile phone" and "how to repair? ".

In another implementation manner, in this embodiment, the long sentence may be split into short sentences according to the positions of the conjunctions in the long sentence. For example, in this embodiment, according to the position of the connection word representing the meaning of turning or juxtaposition, such as "but", "and" then ", in the long sentence, the long sentence is split, for example," i'm falls the mobile phone, and then the mobile phone cannot be opened "to split, so as to obtain the sub-sentences: ' I ' breaks the mobile phone down ' and ' the mobile phone can not be opened '.

Step 103: semantic logical relationships between the sub-sentences are obtained.

The semantic logical relationship refers to a semantic association relationship between sub-sentences, such as a semantic relationship, a conditional relationship, a causal relationship, a parallel relationship, a general relationship, and the like. Wherein, the synonymy relationship can be understood as that the events to be expressed semantically by the two sub-sentences are consistent; a conditional relationship may be understood as a basis on which an event semantically expressed by one sub-sentence occurs for an event semantically expressed by another sub-sentence; causal relationships can be understood as relationships of cause and effect between events that are semantically expressed by two sub-sentences; the parallel relationship can be understood as that two sub-sentences are in parallel relationship between semantically expressed events; a generalized relationship may be understood as an event in which one sub-statement is semantically expressed as a superordinate concept or generalized description of an event in which another sub-statement is semantically expressed.

Specifically, in this embodiment, semantic analysis may be performed on the sub-sentences, so as to obtain an event expressed semantically by each sub-sentence, and further obtain a semantic logical relationship between the sub-sentences. The semantic logical relationship can be understood as semantic chapter relationship.

It should be noted that, in all the sub-sentences split from the target data, there may be a semantic logical relationship between the sub-sentences, or there may be a sub-sentence without a semantic logical relationship. For example, there is a causal semantic logical relationship between the sub-sentence "i fall the mobile phone" and the sub-sentence "the mobile phone cannot be opened"; and the sub-sentence 'I breaks the mobile phone' and the sub-sentence 'home cool noodle is super-delicious' do not have any semantic logic relation.

Step 104: and combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences.

In this embodiment, the sub-sentences having the semantic logical relationship may be combined in a specific combination manner, so that a plurality of new long sentences may be combined, the number of the long sentences may be increased, and data enhancement may be achieved.

As can be seen from the foregoing solutions, in the data processing method provided in the first embodiment of the present application, after splitting a long sentence, sub-sentences are recombined according to semantic logic relationships between the sub-sentences, so that a plurality of new sentences different from the long sentence before splitting are obtained, and thus the number of the long sentences is increased. Therefore, compared with the situation that data obtained by the schemes of vocabulary replacement, reverse translation, text surface conversion and the like do not change much compared with the original data, the number of sentences can be substantially increased in the embodiment, so that the data volume is greatly increased, the data enhancement is realized, and the purpose of improving the reliability of the data enhancement is achieved.

In one implementation, before combining the sub-sentences according to the semantic logical relationship instep 104 to obtain a plurality of new long sentences, the method in this embodiment may further include the following steps, as shown in fig. 2:

step 105: data enhancement is performed on the sub-sentences so that the number of sub-sentences is increased.

Specifically, in this embodiment, the sub-sentences may be subjected to data enhancement processing one or more times in multiple ways, so as to increase the number of the sub-sentences, and on this basis, after the sub-sentences are combined according to the semantic logical relationship between the sub-sentences, the number of the obtained new long sentences may be further increased.

For example, the data enhancement processing on the sub-sentences in the present embodiment may include any one or more of the following:

and performing character replacement on the sub-sentences, performing reverse translation on the sub-sentences, and performing text conversion on the sub-sentences.

The character replacement of the sub-sentences means that the keywords in the sub-sentences, such as subjects or objects, are replaced by words with the same or similar meanings, and the obtained new sub-sentences are semantically the same or similar to the atomic sentences, but the description mode is changed;

the step of performing reverse translation on the sub-sentences refers to that after the sub-sentences are translated according to a certain specific language type, the sub-sentences are reversely translated back to the current language type, and the obtained new sub-sentences are semantically the same as the atomic sentences, but the description mode is changed;

the step of performing text conversion on the sub-sentences refers to performing text surface conversion on the sub-sentences to obtain new sub-sentences, wherein the obtained new sub-sentences are semantically the same as the atomic sentences, but the text composition structure is changed.

It should be noted thatstep 105 may be executed beforestep 103, as shown in fig. 2, or step 105 may also be executed afterstep 103, and all the formed technical solutions are within the scope of the present application.

In one implementation, after obtaining a plurality of new long sentences instep 104, the method in this embodiment may further include the following steps, as shown in fig. 3:

step 106: data enhancement is performed on one or more sub-sentences in the new long sentence such that the number of new long sentences is increased.

Specifically, in this embodiment, one or more times of data enhancement processing may be performed on one or more sub-sentences of the new long sentence in multiple ways, so as to increase the number of long sentences. For example, in the first data enhancement processing, only one sub-statement in the long statement is enhanced, so that the number of the increased long statements is obtained; at the time of the second data enhancement processing, enhancement processing may be performed on two or more sub-sentences in the long sentence, thereby increasing the number of long sentences again, and so on.

For example, the data enhancement processing for the sub-sentences in the long sentence in the present embodiment may include any one or more of the following items:

It should be noted that the long sentence subjected to data enhancement instep 106 may include not only a new long sentence obtained after the sub-sentences are combined, but also an original long sentence before the long sentence is split into the sub-sentences.

In one implementation manner, before performing short sentence splitting on the long sentence instep 102 to obtain a plurality of sub-sentences in the present embodiment, the following processing may be further included, as shown in fig. 4:

step 107: data enhancement is performed on the long sentences in the target data so that the number of the long sentences in the target data is increased.

Specifically, in this embodiment, the long sentences may be subjected to one or more times of data enhancement processing in multiple ways, so as to increase the number of the long sentences, and then, on this basis, the long sentences are subjected to sub-sentence splitting, and the sub-sentences are combined according to the semantic logical relationship between the sub-sentences, so that the sentence splitting and the sentence recombination are realized on the basis of the long sentences the number of which has been increased, thereby further increasing the number of the long sentences.

For example, the data enhancement processing for the long sentence in the present embodiment may include any one or more of the following: the method comprises the steps of carrying out character replacement on the long sentence, carrying out reverse translation on the long sentence and carrying out text conversion on the long sentence.

Based on the above implementation, in the present embodiment, data enhancement may be performed on a long sentence in target data first, and after the sub-sentence splitting is performed instep 102, before combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences instep 104, data enhancement is performed on each sub-sentence separately so that the number of sub-sentences is increased, and thereafter, as shown in fig. 5, on the basis that the number of the sub-sentences is increased, the sub-sentences are combined according to the semantic logical relationship among the sub-sentences to obtain a plurality of new long sentences, and finally, then, data enhancement is performed again on the original long sentences and one or more sub-sentences contained in the new long sentence obtained instep 104, therefore, the number of long sentences is further increased, effective data enhancement is realized, and the reliability of data enhancement is improved.

In one implementation, before combining the sub-sentences according to the semantic logical relationship instep 104 to obtain a plurality of new long sentences, the method in this embodiment may further include the following steps, as shown in fig. 6:

step 108: and grouping the sub-sentences according to the semantic logical relationship to obtain a plurality of sentence groups.

In this embodiment, the sub-sentences may be grouped according to a specific grouping manner according to the semantic logical relationship between the sub-sentences, so that the sub-sentences meeting the corresponding conditions are divided into the same sentence group.

Specifically, in this embodiment, the sentence groups may be divided according to whether the events expressed semantically by the sub-sentences having the semantic logical relationship are the same or similar, so that the semantics of the sub-sentences in the same sentence group satisfy the semantic similarity condition. The semantic similarity condition here can be understood as that the similarity of the semantically expressed events between the sub-sentences is greater than a preset similarity threshold. For example, the sub-sentence "the mobile phone cannot be turned on" and the sub-sentence "the mobile phone cannot be turned on" are both events that can not be turned on, so that the semantics of the two sentences satisfy the semantic similarity condition and can be divided into the same sentence group.

Based on this, in this embodiment, when the sub-sentences are combined according to the semantic logical relationship to obtain a plurality of new long sentences instep 104, the following method may be specifically implemented:

That is to say, in this embodiment, sub-sentences are clustered according to the logical relationship of the sentences, the sub-sentences meeting the sentence similarity condition are divided into the same sentence group, the divided sentence groups have the same semantic logical relationship as the semantic logical relationship of the included sub-sentences, and on the basis, the sub-sentences are arbitrarily selected from the sentence groups according to the semantic logical relationship between the sentence groups and are arranged and combined in a specific manner, so as to obtain a plurality of new long sentences, thereby increasing the number of the long sentences.

In a specific implementation, when one sub-statement is arbitrarily selected from any multiple statement groups for combination according to the semantic logical relationship instep 104, the following process may be specifically implemented, as shown in fig. 7:

step 701: one or more statement sets are obtained according to the semantic logical relationship.

The statement set comprises a plurality of statement groups, and the sub-statements contained in the statement groups in the same statement set have semantic logical relations.

It should be noted that different sentence sets may include the same one or more sentence sets, and of course, different sentence sets may certainly include one or more different sentence sets. The semantic logical relationship between the statement groups included in the statement set is such that the sub-statements in the statement group can be combined into one long statement, as shown in fig. 8, the statement group corresponding to the event "mobile phone purchase" and the statement group corresponding to the event "mobile phone fall" form one statement set, and the statement group corresponding to the event "mobile phone purchase" and the statement group corresponding to the event "mobile phone fall" have the semantic logical relationship of the conditional relationship; as shown in fig. 9, a sentence group corresponding to the event "mobile phone purchase", a sentence group corresponding to the event "mobile phone fall", and a sentence group corresponding to the event "mobile phone service" constitute a sentence set, a semantic logical relationship having a conditional relationship between the sentence group of the event "mobile phone purchase" and the sentence group of the event "mobile phone fall", a semantic logical relationship having a causal relationship between the sentence group corresponding to the event "mobile phone fall" and the sentence group corresponding to the event "mobile phone service", and so on.

Specifically, in this embodiment, one or more statement sets may be obtained in the following manner:

firstly, establishing a map corresponding to the statement group according to the semantic logical relationship.

The graph comprises a plurality of nodes, at least one node corresponds to a statement group, at least two nodes are connected with each other, and the connection between the nodes represents the semantic logic relationship between the statement groups corresponding to the connected nodes. Because the included sub-sentences have semantic logical relations, the sentence groups divided in the embodiment also have corresponding semantic logical relations, and based on this, after the sentence groups are used as nodes in the graph, connection lines between the nodes are established according to the semantic logical relations between the sentence groups. The connection line between the nodes can be actually understood as an association relationship between the nodes or a node path between the nodes, the connection line between the nodes has a connection line direction based on the association relationship or the node path characterized by the connection line, and the direction of the connection line characterizes the direction between the nodes in the association relationship or the node path. In this embodiment, the connection between the nodes represents that the statement groups corresponding to the nodes have a semantic logical relationship therebetween, and the direction of the connection line between the nodes represents the type of the semantic logical relationship between the statement groups corresponding to the nodes. For example, a connection line between a node corresponding to the statement group of "mobile phone purchase" and a node corresponding to the statement group of "mobile phone fall" points from the node corresponding to the statement group of "mobile phone purchase" to the node corresponding to the statement group of "mobile phone fall", and characterizes: the sentence group of the 'mobile phone purchase' and the sentence group of the 'mobile phone fall' have semantic logical relation of conditional relation, and the node corresponding to the sentence group of the 'mobile phone purchase' is the conditional node of the node corresponding to the sentence group of the 'mobile phone fall', and the like.

As shown in the graph in fig. 10, each statement group corresponds to an event expressed by a sub-statement in the statement group, such as "mobile phone purchase", "mobile phone fall", "mobile phone water inlet", "unable to start", "mobile phone repair", and so on, and because of the semantic logical relationship between the sub-statements, the statement groups have corresponding semantic logical relationship, based on which, a graph composed of nodes corresponding to the statement groups is established according to the semantic logical relationship, wherein the statement group of "mobile phone purchase", "mobile phone fall" and "mobile phone repair" correspond to a node A, B, C respectively, a connection line is formed between a and B and between B and C, the connection line between a and B represents the semantic logical relationship of the conditional relationship between the statement group of "mobile phone purchase" and the statement group of "mobile phone fall", and the connection line between B and C represents the semantic logical relationship of the causal relationship between the statement group of "mobile phone fall" and the statement group of "mobile phone repair" A logical relationship.

And then, acquiring one or more statement sets according to the nodes and the connecting lines in the graph.

The statement sets comprise a plurality of statement groups, and the statement groups in the same statement set are connected between corresponding nodes in the graph.

Specifically, in this embodiment, statement groups corresponding to one or more nodes connected together are divided into a statement set according to a connection line between the nodes in the graph, and thus the nodes corresponding to the statement groups divided into the same statement set are directly connected or connected through other nodes. Further, when the sentence sets are divided, the sentence groups corresponding to all the nodes connected together are not divided into one sentence set, but the sentence groups corresponding to two or more nodes connected together are selected from all the nodes connected together and divided into sets, and therefore, the same sentence group may exist in different sentence sets.

Taking the graph in fig. 10 as an example, in the nodes A, B and C connected together, the sentence group corresponding to a and B is selected to be divided into a sentence set x, and then the sentence group corresponding to A, B and C is selected to be divided into a sentence set y, as shown in fig. 11, so that the sentence set x and the sentence set y contain the same sentence group: the statement groups corresponding to a and B, of course, there are also different statement groups: c corresponding statement group.

Step 702: and aiming at the statement set, randomly selecting one sub-statement from the statement group contained in the statement set to combine to obtain a plurality of new long statements.

The new long sentence comprises a plurality of sub-sentences, and semantic logical relations exist among the sub-sentences contained in the new long sentence.

That is to say, in this embodiment, one sub-statement is respectively selected from the statement groups included in the statement set, and each sub-statement included in the obtained new long statement is respectively from each statement group included in the statement set; when the sub-sentences are selected from the sentence groups in the sentence set, the sub-sentences can be selected according to a random algorithm or a permutation and combination mode, so that the situation that the combined new long sentences have repetition is avoided, and the maximum number of the combined new long sentences is ensured.

Taking the sentence set y divided in fig. 11 as an example, there are 2 sub-sentences in the sentence group of "mobile phone purchase", there are 3 sub-sentences in the sentence group of "mobile phone has fallen", there are 3 sub-sentences in the sentence group of "mobile phone has sent and repaired", when selecting a sub-sentence in the sentence set y to combine into a long sentence, arbitrarily select a sub-sentence from the sentence group of "mobile phone purchase", arbitrarily select a sub-sentence in the sentence group of "mobile phone has fallen", arbitrarily select a sub-sentence in the sentence group of "mobile phone has sent and repaired", as shown in fig. 12, after multiple selections, 2 × 3, that is, 18 long sentences can be obtained, thereby greatly increasing the number of long sentences.

Referring to fig. 13, a schematic structural diagram of a data processing apparatus provided in the second embodiment of the present application is shown, where the apparatus may be configured in an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for improving the reliability of data enhancement.

Specifically, the apparatus in this embodiment may include the following units:

adata obtaining unit 1301, configured to obtain target data, where the target data includes multiple long statements, and a long statement is a statement with complete semantics;

asentence splitting unit 1302, configured to split a short sentence from a long sentence to obtain multiple sub-sentences;

alogic obtaining unit 1303, configured to obtain semantic logic relationships between the sub-sentences;

and asentence combination unit 1304, configured to combine the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences.

As can be seen from the foregoing solution, in the data processing apparatus provided in the second embodiment of the present application, after the long sentence is split, the sub-sentences are recombined according to the semantic logical relationship between the sub-sentences, so that a plurality of new sentences different from the long sentence before the split are obtained, and the number of the long sentences is increased. Therefore, compared with the situation that data obtained by the schemes of vocabulary replacement, reverse translation, text surface conversion and the like do not change much compared with the original data, the number of sentences can be substantially increased in the embodiment, so that the data volume is greatly increased, the data enhancement is realized, and the purpose of improving the reliability of the data enhancement is achieved.

In one implementation, the apparatus in this embodiment may include the following units, as shown in fig. 14:

astatement enhancing unit 1305, configured to perform data enhancement on the sub-statements before thestatement combining unit 1304 combines the sub-statements according to the semantic logical relationship to obtain a plurality of new long statements, so that the number of the sub-statements is increased.

In one implementation, thestatement enhancing unit 1305 is further configured to perform data enhancement on one or more sub-statements in the new long statement after thestatement combining unit 1304 obtains a plurality of new long statements, so that the number of the new long statements is increased.

Further, thestatement enhancing unit 1305 is specifically configured to: and performing character replacement on the sub-sentence, performing reverse translation on the sub-sentence, and performing text conversion on the sub-sentence.

In one implementation, before thesentence combining unit 1304 combines the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences, the method is further configured to: grouping the sub-sentences according to the semantic logical relationship to obtain a plurality of sentence groups; and the semantics of the sub-sentences in the same sentence group meet the semantic similarity condition.

Based on this, thesentence combination unit 1304 is specifically configured to: and according to the semantic logical relationship, randomly selecting one sub-statement from any plurality of statement groups to be combined so as to obtain a plurality of new long statements.

In one implementation, thesentence combination unit 1304 is specifically configured to: obtaining one or more statement sets according to the semantic logical relationship, wherein the statement sets comprise a plurality of statement groups, and the sub-statements contained in the statement groups in the same statement set have the semantic logical relationship; for example, according to the semantic logical relationship, a graph corresponding to the statement group is established, the graph includes a plurality of nodes, at least one node corresponds to the statement group, and at least two nodes have a connection line therebetween, the connection line between the nodes represents the semantic logical relationship between the statement groups corresponding to the connected nodes; obtaining one or more statement sets according to the nodes and the connecting lines in the graph, wherein the statement sets comprise a plurality of statement groups, and the connecting lines exist between the corresponding nodes in the graph for the statement groups in the same statement set; and then, aiming at the statement set, randomly selecting one sub-statement from the statement group contained in the statement set to combine to obtain a plurality of new long statements, wherein the new long statements comprise a plurality of sub-statements, and the sub-statements contained in the new long statements have semantic logical relationship.

In one implementation, thestatement enhancing unit 1305 is further configured to perform data enhancement on the long statement in the target data before thestatement splitting unit 1302 performs short statement splitting on the long statement to obtain multiple sub-statements, so that the number of the long statements in the target data is increased.

It should be noted that, for the specific implementation of each unit in the present embodiment, reference may be made to the corresponding content in the foregoing, and details are not described here.

Referring to fig. 15, a schematic structural diagram of an electronic device according to a third embodiment of the present application is provided, where the electronic device may be an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for improving the reliability of data enhancement.

Specifically, the electronic device in this embodiment may include the following structure:

amemory 1501 for storing an application program and data generated by the application program running;

aprocessor 1502 for executing an application to implement: obtaining target data, wherein the target data comprises a plurality of long sentences; splitting the short sentence of the long sentence to obtain a plurality of sub-sentences; obtaining semantic logical relations between the sub-sentences; and combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences.

The electronic device in this embodiment may further include other components, such as a display, an input/output device, and the like.

According to the scheme, in the electronic device provided by the third embodiment of the present application, after the long sentence is split, the sub-sentences are recombined according to the semantic logic relationship between the sub-sentences, so that a plurality of new sentences different from the long sentence before the split are obtained, and the number of the long sentences is increased. Therefore, compared with the situation that data obtained by the schemes of vocabulary replacement, reverse translation, text surface conversion and the like do not change much compared with the original data, the number of sentences can be substantially increased in the embodiment, so that the data volume is greatly increased, the data enhancement is realized, and the purpose of improving the reliability of the data enhancement is achieved.

Taking training of a deep learning model as an example, the technical scheme of the application is explained in detail as follows:

first, the inventors of the present application discovered, in the course of studying the deep learning model, that: at present, in the field of natural language processing, a lot of problems can be caused due to insufficient data, such as: the deep learning model has no 'armed place' due to the shortage of the training corpora, on the other hand, the real performance of the model cannot be measured due to the shortage of the test data, and the data enhancement technology plays a key role in 'carbon delivery in snow'. The solutions available at present are the following:

1. the method comprises the following steps of vocabulary replacement, reverse translation and text surface conversion, but the diversity of data is not increased in the implementation scheme, and the change is small compared with the original data;

2. the random noise injection scheme, but the noise can be increased in the implementation scheme to cause negative influence on the original semantics;

3. an example of a cross-enhanced scheme, but where the generated statements are logically disjointed, do not conform to grammatical or semantic requirements, as shown in FIG. 16.

In view of this, the inventors of the present application have found, by analyzing existing solutions, that they do not really enhance the data size and sometimes may help "fall busy", and therefore propose a data enhancement scheme based on logical relationships between events.

According to the technical scheme, the events are combined according to the logical relation among the events by combining the event map, and then data enhancement is performed on the description data of each event. Therefore, the data diversity can be enhanced, and the correct logic of the generated data is ensured. The specific technical scheme is as follows:

1. firstly, extracting and constructing a case map, and specifically comprising the following steps of;

(1) crawling or generalizing part of the data;

(2) analyzing each sentence into small sentences;

(3) judging the logical relationship between the small sentences by analyzing the relationship between discourse structures;

(4) and clustering the clauses to obtain a affair map.

As shown in fig. 17, in this embodiment, a plurality of long sentences are crawled first, then, a clause and a logical relationship are obtained by parsing and relationship extraction, and then, a graph is obtained by clustering and summarizing or extraction, as shown in fig. 18.

2. Secondly, combining the events according to the logical relationship among the events, as shown in fig. 19, combining a plurality of statement sets according to the logical relationship, wherein the statement groups in each statement set have corresponding logical relationships;

3. finally, data enhancement (vocabulary replacement, reverse translation, etc.) is performed on each event instance, as shown in fig. 20, sentence enhancement is performed on the sentence groups in the sentence set respectively, so that the sentences in the sentence groups are increased, and based on this, after sentence combination, more long sentences can be obtained.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data processing, comprising:

obtaining semantic logical relations between the sub-sentences;

2. The method of claim 1, before combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences, the method further comprising:

3. The method of claim 1 or 2, after obtaining a plurality of new long sentences, the method further comprising:

4. The method of claim 2 or 3, performing data enhancement on the sub-sentences, including any one or more of:

performing character replacement on the sub-sentence;

performing reverse translation on the sub-sentence;

and performing text conversion on the sub-sentence.

5. The method of claim 1, before combining the sub-sentences according to the semantic logical relationship to obtain a plurality of new long sentences, the method further comprising:

6. The method of claim 5, wherein the semantics of the sub-sentences in the same sentence group satisfy a semantic similarity condition.

7. The method of claim 5, wherein selecting one of the sub-sentences from any of the sentence groups to combine to obtain a plurality of new long sentences according to the semantic logical relationship, comprises:

8. The method of claim 7, obtaining one or more sets of statements according to the semantic logical relationship, comprising:

9. The method of claim 1, prior to performing short sentence splitting on the long sentence to obtain a plurality of sub-sentences, the method further comprising:

10. A data processing apparatus comprising: