Disclosure of Invention
Aiming at the technical problems, the invention provides a method and a system for associating knowledge base documents and knowledge graph entities, which can improve the accuracy and recall rate of entity association.
The technical scheme for solving the technical problems is as follows:
in a first aspect, the present invention provides a method for associating knowledge base documents with knowledge-graph entities, comprising:
performing entity identification on the text to obtain an entity list;
searching in a knowledge graph library according to the entities in the entity list to obtain at least one candidate entity;
respectively calculating the similarity of the first characteristic information of the text and the second characteristic information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
The invention has the beneficial effects that:
the similarity is calculated by fully utilizing the characteristic information of the text and the characteristic information of the candidate entity and the associated node searched according to the entity in the text, so that the accuracy and the recall rate of entity association are effectively improved.
On the basis of the technical scheme, the invention can be further improved as follows.
Further, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Further, the position of the entity in the entity list in the document of the knowledge base is inquired, and a position list corresponding to the entity is obtained.
Further, the format of the entity of the document of the knowledge base in the position of the position list is emphasized.
In a second aspect, the present invention further provides a system for associating knowledge base documents with knowledge-graph entities, comprising:
the entity identification module is used for carrying out entity identification on the text to obtain an entity list;
the candidate entity searching module is used for searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
the similarity calculation module is used for calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities respectively, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and the entity association module is used for associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Further, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Further, still include:
and the position query module is used for querying the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Further, still include:
and the format processing module is used for emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
In a third aspect, the present invention also provides an electronic device, including: the device comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the processor executes the machine-readable instructions to execute the steps of the method.
In a fourth aspect, the present invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program performs the steps of the method.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of a method for associating knowledge-base documents and knowledge-graph entities according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1, performing entity recognition on the text to obtain an entity list;
specifically, the text is a section of text in a knowledge base document, and the CRF entity recognition model is used for performing entity recognition on the knowledge base document, recognizing entities such as names of people and objects, and obtaining an entity list of the text.
S2, searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
as known to those skilled in the art, a knowledge-graph is composed of entities (nodes) and entity relationships (edges), where the entities have descriptive information such as names, attributes, and the like. Entity relationships also have names and attributes, and have directions.
S3, respectively calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
specifically, the first feature information may be a sum of word vectors of feature words of a text where the entity is located, and the specific description is as follows:
segmenting text and calculating each wordThe word frequency of the words (the number of occurrences of a word divided by the total number of words in the document) is ranked from high to low according to the word frequency score. And obtaining Top n vocabularies before ranking as text characteristic words. Adding the word vectors of the determined n feature words:
wherein V
iThe word vector of the ith word is represented, and textVec represents the text abstract vector to be processed, namely the first characteristic information. The word vector can be obtained by training Chinese pre-training based on encyclopedic data by using FastText (fast text classification algorithm), and the dimension of the word vector is 300 dimensions, the same below.
The associated nodes are nodes such as first degree relationship nodes and second degree relationship nodes which have an associated relationship with the candidate entity in the knowledge graph, the second characteristic information can be the sum of word vectors of node names and attributes, the similarity of the first characteristic information of the text and the candidate entity nodes and the second characteristic information of the first degree relationship nodes and the second degree relationship nodes are calculated and weighted and summed respectively, and the total similarity of the candidate entity can be obtained, and the method specifically comprises the following steps:
1) and carrying out similarity calculation according to sentences. Dividing documents in a knowledge base according to periods, aiming at sentences where entities are located, then according to the sentence number, dividing the sentences to obtain word vectors of the documents, adding the word vectors to form senVec, obtaining word vectors of candidate entity node names and attributes, adding the word vectors to form attrVec, and then calculating cosine similarity of the vectors by using the senVec and the attrVec:
| x | | represents the norm of the vector x, giving the score senScore.
2) And acquiring the candidate entity node name and attribute and the word vectors of the one-degree relation node name and attribute of the candidate entity node name and attribute, adding the candidate entity node name and attribute and the word vectors to form firstRelVec, and performing similarity calculation on the firstRelVec and the text abstract textVec to obtain a score firstRelScore.
3) And acquiring the names and the attributes of the candidate nodes and the word vectors of the names and the attributes of the two-degree relation nodes of the candidate nodes, adding the candidate nodes and the attributes to form a secondrelVec, and calculating the similarity of the vectors and the text abstract textVec to obtain a score secondrelScore.
4) The scores of the candidate nodes searched by each entity are respectively set with different weights, and the weights are configurable and then summed.
And S4, associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Specifically, if the number of candidate entities searched in the knowledge map library in step S2 is greater than one, feature matching and semantic calculation need to be performed according to step S3, and the maximum total similarity is determined, so as to find the best matching candidate entity. And further judging whether the maximum similarity reaches an association threshold, if so, associating, and returning the entity ID of the candidate entity, namely doc _ ID, if not, not associating.
If only one matched entity is searched, the total similarity is directly calculated through the step S3, whether the correlation threshold is reached is judged, if so, the correlation is carried out, and doc _ id is returned, and the doc _ id is not reached and is not associated.
The method for associating the knowledge base document with the knowledge graph entity provided by the embodiment of the invention can extract effective characteristics, fully utilize the entity existing in the text, the sentence where the entity is located, the text abstract and the entity in the graph, the entity attribute, the first-degree relation and the relation entity and the second-degree relation and the correlation degree of the relation entity, and effectively improve the accuracy and the recall rate of entity association.
The existing entity association method has another problem that the entity in the document is associated with the knowledge graph, but the position of the associated entity in the document is required to be obtained, so that the entity cannot be directly obtained, and particularly when the number of pages of the document is too large. To address this issue, optionally, in this embodiment, the method further includes:
s5, inquiring the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Specifically, the position of the entity in the document of the knowledge base may be the page number of the entity, and in this embodiment, an Elasticsearch engine may be used to query the page number of the entity in the document of the knowledge base, so as to obtain a page number list of all the page numbers where the entity appears. Thus, when the doc _ id of the entity is returned, the page number list corresponding to the doc _ id can be further returned.
In order to further facilitate the user to quickly view the association information of the entity in real time, optionally, in this embodiment, the method further includes:
and S6, emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
Specifically, according to the page list corresponding to doc _ id, emphasis processing such as thickening and highlighting can be performed on the format of the entity in the document page content, so that the document entity corresponding to the entity link can be found quickly and conveniently.
The following illustrates the principles of the present invention, for example, the following text processes:
"the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. It is known that in the beginning of the next month, Hanhong, as many as hundreds of love people and medical experts form a love fleet of rescue volunteers for 20 days of public service travel. "
The text above presses first ". "split into two sentences
Sentence 1: "the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. "
Sentence 2: "it is known that Korean red will be combined with hundreds of loved persons and medical experts to form a love fleet of recovering volunteers for 20 days of public interest in the beginning of the next month. "
And aiming at the fact that the entity is a sentence, performing word segmentation on the sentence, acquiring word vectors of all words through FastText, and adding the word vectors to form a sentence vector senVec. Acquiring word vectors of node names and attributes of candidate entities of Korean red, Yaoming and octopus, adding the word vectors to form attrVec, and then calculating vector similarity by using senVec and attrVec:
| x | | represents the norm of vector x to obtain the scoresenScore。
And then acquiring the node names and attributes of the candidate entities of Korean red, Yaoming and octoyi and word vectors of the first-degree relation node names and attributes of the candidate entities, adding the node names and attributes to form firstLeVec, and performing similarity calculation with the text abstract textVec, wherein the formula is the same as the formula above, so as to obtain the score firstLeScore.
And acquiring the node names and attributes of the candidate entities of Korean, Yaming and Octope and word vectors of the two-degree relation node names and attributes of the candidate entities, adding the word vectors to form secondRelVec, and calculating the vector similarity with the text abstract textVec, wherein the formula is the same as the formula, so as to obtain the score secondRelScore.
The calculated scores are weighted differently and are configurable, and if the score of the one-degree relationship is more weighted, the weight of firstDelScore is set higher, assuming 0.7, the remaining score weight senScore is 0.2, and secondreScore is 0.1, and the scores are multiplied by the weights and summed sum. And comparing sum of each entity with a set threshold, if the sum is greater than the threshold, associating, and obtaining a page list of associated entities according to the page corresponding to each entity obtained from the elastic search.
Fig. 2 is a block diagram of a system for associating knowledge base documents and knowledge graph entities according to an embodiment of the present invention, where functional principles of various modules in the system have been described in the foregoing method embodiment, and are not described in detail below.
As shown in fig. 2, the system includes:
the entity identification module is used for carrying out entity identification on the text to obtain an entity list;
the candidate entity searching module is used for searching in a knowledge spectrum library according to the entities in the entity list to obtain at least one candidate entity;
the similarity calculation module is used for calculating the similarity between the first feature information of the text and the second feature information of each candidate entity and at least one associated node of the candidate entities respectively, and performing weighted calculation on the calculated similarities according to corresponding weights to obtain the total similarity corresponding to each candidate entity;
and the entity association module is used for associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.
Optionally, in this embodiment, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.
Optionally, in this embodiment, the system further includes:
and the position query module is used for querying the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.
Optionally, in this embodiment, the system further includes:
and the format processing module is used for emphasizing the format of the entity of the document of the knowledge base in the position of the position list.
FIG. 3 is a schematic diagram illustrating a computing device according to an exemplary embodiment of the present invention.
Referring to fig. 3, computing device 300 includes memory 310 and processor 320.
The Processor 320 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 310 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 320 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 310 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 310 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory 310 has stored thereon executable code that, when processed by the processor 320, may cause the processor 320 to perform some or all of the methods described above.
The aspects of the invention have been described in detail hereinabove with reference to the drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required by the invention. In addition, it can be understood that the steps in the method according to the embodiment of the present invention may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device according to the embodiment of the present invention may be combined, divided, and deleted according to actual needs.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out some or all of the steps of the above-described method of the invention.
Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.