Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a relation extracting method for text according to an embodiment of the present invention, including the following steps:
S11, coding a text by using BERT, and copying and upsizing the sequence dimension of a coding result to obtain a matrix based on segment arrangement;
s12, performing multi-label classification on the character strings corresponding to each coordinate in the matrix to obtain a label set of each coordinate;
And S13, traversing all head labels, tail labels and entity labels in the label set of each coordinate, carrying out handshake marking pairing to extract the relation, and determining at least one triplet for limiting the relation or opening the relation, wherein the triplet is used for representing the relation among the entities in the text.
In the embodiment, the method can be applied to the fields of text mining, information retrieval, intelligent question-answering and the like, and relation triples in the text are extracted to represent the relation of the text. The text may be a dialogue between users, a question entered by users, or an article.
For step S11, for the text x= [ X1,x2,…,xn ], the output result h= [ H1,h2,…,hn ] after BERT encoding. Wherein BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder represents a transducer model), which is a model based on a coding-decoding structure.
The encoded output result is replicated and up-scaled in the dimension of the sequence length to obtain a matrix a and a transposed matrix aT, and a logits output matrix S with dimension n×n, as shown in fig. 2. Wherein logits is the output of the final fully connected layer.
Each element Sij in the output matrix has the formula:
Wherein,For vector concatenation operations, +.is vector dot product operations, W1 and W2 are weight matrices, and b1 and b2 are bias matrices. The dimension n x T of the final matrix S, T is the number of label sets Lall.
In the matrix S determined in step S11, in step S12, since the labels at each position in the matrix S are not unique, for a certain element Sij of the matrix S, a plurality of labels may be owned, and therefore, it is necessary to classify Sij with multiple labels. The sij dimension is 1×t, i.e., a 01 classification is made for all tags. When sij corresponds to the t-th tag, a corresponding value of 0 indicates that sij does not have a t-tag, and vice versa. Thus, the objective function (objective function) is:
Wherein |d| is the number of training set data, n is the sentence length, T is the total number of labels, Pijt represents the probability that the value of the corresponding T-th label at sij is 1, and yijt represents the true value (0 or 1) of the corresponding T-th label at sij. In the prediction stage, by setting a threshold, for example, 0.5 (specifically, the threshold may be adjusted according to practical situations, and is not limited herein), when the probability of Pijt is greater than 0.5, it is considered that the Lt tag is present at sij.
The tag set comprises a subject entity tag, an object entity tag, a predicate entity tag, two entity head character tags, two entity tail character tags, a main-predicate entity head character tag, a main-predicate entity tail character tag, a guest-predicate entity head character tag and a guest-predicate entity tail character tag.
Specifically, the tag set Lall is:
The label is time, the paraphrasing is a time entity, and the task is named entity identification;
label, person; the definition is character entity, the task is named entity identification;
label: subject; the paraphrasing is a subject entity, and the tasks are open relation extraction and limit relation extraction;
A tag, object; the task comprises extracting open relation and limiting relation;
the label is a station_head 2head, paraphrasing is the head characters of two entities with ethnic relations, and the task is to define relation extraction;
The label is station_tail 2tail, paraphrasing is that tail characters of two entities with ethnic relations exist, and the task is that the relation is limited and extracted;
the task is extracted by an open relation;
the label is subject_head2 predictehead, paraphrasing the head character of the main-predicate entity with the open relation, and extracting the open relation;
the label is subject_tail2 pre-event_tail, paraphrasing that the tail characters of the main-predicate entity with the open relation exist, and extracting the open relation;
The label is object_head2 predicte_head, paraphrasing is that the head characters of the guest-to-guest entities with open relations exist, and the tasks are that the open relations are extracted;
the label is object_tail2 prediction_tail, paraphrasing that the tail characters of the guest-to-guest entities with open relations exist, and extracting the open relations;
In practice, the tag may not only be as described above, but may also be incrementally adjusted based on actual needs. For example, in a sentence in a home structure, there may be a label Parent_head2head, paraphrase, task, definition relationship extraction, label Parent_Tail2tail, paraphrase, tail, two entities with Parent-child relationships, and task, definition relationship extraction. That is, if there are other demands, more labels can be added, and the description is omitted here. The text takes Yao Mou love basketball as an example, the sequence dimension of the coding result is copied and up-scaled to obtain a matrix based on segment arrangement, and the left and right of the abscissa of the matrix are sequentially "yao", "certain love", "basket", "ball", and the upper and lower of the ordinate are sequentially "yao", "certain love", "basket", "ball".
The multi-label classification can be performed by the character strings corresponding to the matrices, for example, the matrices (0, 1) which are developed based on the segment arrangement. The probability that the character string (abscissa corresponds to "yao", ordinate corresponds to "some") indicating the starting position is 0 and the ending position is 1 corresponds to "Yao Mou", the probability that Pijt corresponds to (yao, some person) is 0.9, which is higher than the preset 0.5, and the label is determined to be "person", namely "Yao Mou" is a person entity. The probability of Pijt corresponding to (Yao, certain, time) is 0, which is lower than the preset 0.5, and the label of which the "time" is not the label is obtained. In the same manner, each of the labels exemplified above is judged separately. If there are multiple tags that match, a set of tags in the (0, 1) coordinates can be obtained. Other coordinates are judged in the same way, and the details are not repeated here.
For step S13, taking more complicated text as an example, "Yao Mou and wife She Mou are present together", the tag set of each coordinate is obtained by processing the steps of the steps S11 and S12, as shown in FIG. 3, where S is a subject, a subject entity, P is a predicate entity, O is a subject, an object entity, SH is a subject_head_head, a head character of the subject entity having an open relationship, ST is a subject_tail2 pre_tail, a tail character of the subject entity having an open relationship, OH is a subject_head2pre head, a head character of the guest entity having an open relationship, OT is a subject_tail2 pre_tail, and a tail character of the guest entity having an open relationship.
Each element is shaped as (i, j, L), i, j being the position in the matrix, L (L e Lall) indicating that the position has an L tag. All head-to-tail tags (tags containing head and tail) and all entity tags then need to be traversed to yield an explicit triplet.
Taking a more complex open relation extraction as an example, when there are three elements of the open relation extraction in the input text X, (is,js,subject),(ip,jp,predicate),(io,jo, object) (in short, there is a main predicate in a sentence), and the satisfying matrix position (is,ip) has a "subject_head2 predictor_head" tag, the matrix position (js,jp) has a "subject_tail2 predictor_tail" tag, the matrix position (io,ip) has a "subject_head2 predictor_head" tag, and the matrix position (jo,jp) has a "subject_tail2 predictor_tail" tag, the triplet < X [ is:js],X[ip:jp],X[io:jo ] is obtained. The above-described labeling mode may be visually referred to as handshake labeling. For example < Yao Mou, wife, she Mou >.
As an embodiment, the triples of defined relationships are determined by triples of preset relationship types, and the triples of open relationships are determined by triples of relationships that are not defined in the text.
In this embodiment, the extraction of the defined relationship is used to extract a relationship triplet of a preset relationship type. This type of text structure is relatively simple, with the tag sets being dominated by "{ REL } -head 2head" and "{ REL } -tail 2tail", where REL represents a predefined relationship type value. When the matrix (i, j) position developed based on the segment arrangement has "{ REL } -head 2head", it means that there is a REL relationship between an entity (named entity or subject entity) with i as the starting position and another entity (named entity or object entity) with j as the starting position, and "{ REL } -tail 2tail" is used to indicate the ending position, and the meaning is the same. Since the named entity recognition task cannot encompass all entity types, the subject and object of some relationship triples do not have strong named entity tags, and therefore "subject" and "object" tags are added. Taking the matrix of text "Yao Mou, han nationality" as an example, the label "person" is owned by the matrix (0, 1) position (abscissa corresponds to "yao", ordinate corresponds to "certain") which is spread based on the segment arrangement, the label "object" is owned by the matrix (3, 4) position (abscissa corresponds to "han", ordinate corresponds to "family"), the label "position" is owned by the matrix (0, 3) (abscissa corresponds to "yao", ordinate corresponds to "min") and the label "position_head 2head" is owned by the matrix (1, 4) position (abscissa corresponds to "certain", ordinate corresponds to "family") and the label "position_tail 2tail". At this time, it can be uniquely determined that the character entity 'Yao Mou' and the object entity 'han nationality' have a 'ethnic' relationship, and then the triplet < Yao Mou, ethnic group and han nationality > is obtained.
Open relationship extraction, which is to extract relationship triples that occur and are not limiting in text. This type of text structure is relatively complex. The tag set is mainly "subject_head2predicate_head"、"subject_tail2predicate_tail"、"object_head2predicate_head"、"object_tail2predicate_tail", and the task can additionally identify a predicate entity of 'prediction' (in short, a sentence may have a main-predicate and guest-predicate structure, and the structure is more consistent with the speech structure of a user in daily life). When the position of the matrix (i, j) which is unfolded based on the fragment arrangement has a 'subject_head 2 prediction_head', the main language entity taking i as a starting position and another predicate entity taking j as a starting position are in the same group of open relations, and the other labels are in the same meaning. Taking the text "double line" as an example, a matrix determined by a game "issued by EA, the matrix (1, 4) position (abscissa corresponds to" double ", ordinate corresponds to" line ") has a label" subject ", the matrix (10, 11) position (abscissa corresponds to" E ", ordinate corresponds to" a ") has a label" subject ", the matrix (12, 13) position (abscissa corresponds to" transmission ", ordinate corresponds to" line ") has a label" subject ", the matrix (1, 12) position (abscissa corresponds to" double ", ordinate corresponds to" transmission ") has a label" subject_head2 precursor_head ", the matrix (4, 13) position (abscissa corresponds to" line ", ordinate corresponds to" line ") has a label" subject_tail2 precursor_tail ", the matrix (10, 12) position (abscissa corresponds to" E ", ordinate corresponds to" transmission ") has a label" subject_head2 precursor_ad ", and the matrix (11, 13) position (abscissa corresponds to" subject_head2 precursor_tail ") has a label" subject_tail ". At this time, the fact that the subject entity 'double person is in line' and the object entity 'EA' has an open relation 'issue' can be uniquely determined, and the triplet < double person is in line, issue, EA >.
As can be seen from this embodiment, with the set tag set and the pairing relation extraction, both the restricted relation extraction and the open relation extraction can be realized. The coding information loss can be reduced to a great extent, and the relation triples with multiple dimensions can be accurately represented.
As an embodiment, after said determining at least one triplet of defined relationships or open relationships, the method further comprises:
When the entity relation conflict exists in the triples, determining the average score of the sum of all the tag probabilities in the triples, and deleting the triples with the lowest scores in the triples so as to solve the entity relation conflict.
In this embodiment, when the above steps determine a set of triples, not only the main guests (the constraint relationship is extracted as the main guests) but also the corresponding labels of the boundaries of the entities need to be identified. Thus, the final score for a triplet is defined as determining all tag element probabilities for that triplet and averaging:
Where N represents the number of all elements determining the triplet (e.g., the defined relation extraction is typically 4, the open relation extraction is typically 7, and thus a better result is obtained), and p (i, j, L) represents the corresponding probability score for an element. The score may be used to balance the accuracy and recall of the produced triples and may be used to solve the triplet conflict problem. Take the text "national president a interview B national president B" as an example. If the original result identifies a triplet of < country A, president, a >, < country B, president, B >, < country A, president, B >. At this time, the triples have obvious conflict, the scores of the triples are calculated and ordered through the above formula, and the triples < A country, president and b > with the conflict and lower score are removed. Thereby solving the problem of entity relationship conflict. The overall framework of the method is shown generally in fig. 4.
It can be seen from this embodiment that after the multidimensional relation triples are determined, whether the triples conflict with each other is further judged, so that the accuracy of extracting the relation is further ensured.
Fig. 5 is a schematic structural diagram of a relation extracting system for text according to an embodiment of the present invention, where the relation extracting system may perform the relation extracting method for text according to any of the above embodiments and be configured in a terminal.
The relation extraction system 10 for text provided in this embodiment includes a matrix determination program module 11, a label classification program module 12, and a label pairing program module 13.
The method comprises the steps of a matrix determining program module 11, a label classifying program module 12 and a label pairing program module 13, wherein the matrix determining program module 11 is used for coding texts by using BERT, copying and upgrading the sequence dimension of coding results to obtain a matrix based on segment arrangement, the label classifying program module 12 is used for performing multi-label classification on character strings corresponding to each coordinate in the matrix to obtain a label set of each coordinate, the label pairing program module 13 is used for traversing all head labels, tail labels and entity labels in the label set of each coordinate to perform handshake label pairing and relation extraction, and determining at least one triplet defining relation or open relation and used for representing the relation among entities in the texts.
Further, the system also includes a conflict resolution program module for:
When the entity relation conflict exists in the triples, determining the average score of the sum of all the tag probabilities in the triples, and deleting the triples with the lowest scores in the triples so as to solve the entity relation conflict.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the relation extraction method for the text in any of the method embodiments;
As one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
Coding the text by using BERT, and copying and upsizing the sequence dimension of the coding result to obtain a matrix based on segment arrangement;
Performing multi-label classification on the character strings corresponding to each coordinate in the matrix to obtain a label set of each coordinate;
and traversing all head labels, tail labels and entity labels in the label set of each coordinate, carrying out handshake marking pairing to extract the relation, and determining at least one triplet defining the relation or opening the relation for representing the relation among the entities in the text.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the relation extraction method for text in any of the method embodiments described above.
Fig. 6 is a schematic hardware structure of an electronic device for a relation extracting method for text according to another embodiment of the present application, as shown in fig. 6, the device includes:
one or more processors 610, and a memory 620, one processor 610 being illustrated in fig. 6. The apparatus for the relation extracting method of text may further include an input device 630 and an output device 640.
The processor 610, memory 620, input devices 630, and output devices 640 may be connected by a bus or other means, for example in fig. 6.
The memory 620 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the relation extraction method for text in the embodiment of the present application. The processor 610 executes various functional applications of the server and data processing, i.e., implements the relationship extraction method for text of the above-described method embodiment, by running non-volatile software programs, instructions, and modules stored in the memory 620.
The memory 620 may include a storage program area that may store an operating system, application programs required for at least one function, a storage data area that may store data, and the like. In addition, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 620 optionally includes memory remotely located relative to processor 610, which may be connected to the mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 630 may receive input numeric or character information. The output device 640 may include a display device such as a display screen.
The one or more modules are stored in the memory 620 that, when executed by the one or more processors 610, perform the relationship extraction method for text in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The non-transitory computer readable storage medium may include a storage program area that may store an operating system, an application program required for at least one function, and a storage data area that may store data created according to use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the steps of the relation extraction method for text of any embodiment of the invention.
The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of additional identical elements in a process, method, article, or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.