wherein V_iThe word vector of the ith word is represented, and textVec represents the text abstract vector to be processed, namely the first characteristic information. The word vector can be obtained by training Chinese pre-training based on encyclopedic data by using FastText (fast text classification algorithm), and the dimension of the word vector is 300 dimensions, the same below.

The associated nodes are nodes such as first degree relationship nodes and second degree relationship nodes which have an associated relationship with the candidate entity in the knowledge graph, the second characteristic information can be the sum of word vectors of node names and attributes, the similarity of the first characteristic information of the text and the candidate entity nodes and the second characteristic information of the first degree relationship nodes and the second degree relationship nodes are calculated and weighted and summed respectively, and the total similarity of the candidate entity can be obtained, and the method specifically comprises the following steps:

1) and carrying out similarity calculation according to sentences. Dividing documents in a knowledge base according to periods, aiming at sentences where entities are located, then according to the sentence number, dividing the sentences to obtain word vectors of the documents, adding the word vectors to form senVec, obtaining word vectors of candidate entity node names and attributes, adding the word vectors to form attrVec, and then calculating cosine similarity of the vectors by using the senVec and the attrVec:

| x | | represents the norm of the vector x, giving the score senScore.

2) And acquiring the candidate entity node name and attribute and the word vectors of the one-degree relation node name and attribute of the candidate entity node name and attribute, adding the candidate entity node name and attribute and the word vectors to form firstRelVec, and performing similarity calculation on the firstRelVec and the text abstract textVec to obtain a score firstRelScore.

3) And acquiring the names and the attributes of the candidate nodes and the word vectors of the names and the attributes of the two-degree relation nodes of the candidate nodes, adding the candidate nodes and the attributes to form a secondrelVec, and calculating the similarity of the vectors and the text abstract textVec to obtain a score secondrelScore.

4) The scores of the candidate nodes searched by each entity are respectively set with different weights, and the weights are configurable and then summed.

And S4, associating the entity with the candidate entity corresponding to the maximum total similarity exceeding the threshold value.

Specifically, if the number of candidate entities searched in the knowledge map library in step S2 is greater than one, feature matching and semantic calculation need to be performed according to step S3, and the maximum total similarity is determined, so as to find the best matching candidate entity. And further judging whether the maximum similarity reaches an association threshold, if so, associating, and returning the entity ID of the candidate entity, namely doc _ ID, if not, not associating.

If only one matched entity is searched, the total similarity is directly calculated through the step S3, whether the correlation threshold is reached is judged, if so, the correlation is carried out, and doc _ id is returned, and the doc _ id is not reached and is not associated.

The method for associating the knowledge base document with the knowledge graph entity provided by the embodiment of the invention can extract effective characteristics, fully utilize the entity existing in the text, the sentence where the entity is located, the text abstract and the entity in the graph, the entity attribute, the first-degree relation and the relation entity and the second-degree relation and the correlation degree of the relation entity, and effectively improve the accuracy and the recall rate of entity association.

The existing entity association method has another problem that the entity in the document is associated with the knowledge graph, but the position of the associated entity in the document is required to be obtained, so that the entity cannot be directly obtained, and particularly when the number of pages of the document is too large. To address this issue, optionally, in this embodiment, the method further includes:

s5, inquiring the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.

Specifically, the position of the entity in the document of the knowledge base may be the page number of the entity, and in this embodiment, an Elasticsearch engine may be used to query the page number of the entity in the document of the knowledge base, so as to obtain a page number list of all the page numbers where the entity appears. Thus, when the doc _ id of the entity is returned, the page number list corresponding to the doc _ id can be further returned.

In order to further facilitate the user to quickly view the association information of the entity in real time, optionally, in this embodiment, the method further includes:

and S6, emphasizing the format of the entity of the document of the knowledge base in the position of the position list.

Specifically, according to the page list corresponding to doc _ id, emphasis processing such as thickening and highlighting can be performed on the format of the entity in the document page content, so that the document entity corresponding to the entity link can be found quickly and conveniently.

The following illustrates the principles of the present invention, for example, the following text processes:

"the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. It is known that in the beginning of the next month, Hanhong, as many as hundreds of love people and medical experts form a love fleet of rescue volunteers for 20 days of public service travel. "

The text above presses first ". "split into two sentences

Sentence 1: "the famous singer hanyaoming appears together with octogen on the tibetan public welfare event release party initiated by himself. "

Sentence 2: "it is known that Korean red will be combined with hundreds of loved persons and medical experts to form a love fleet of recovering volunteers for 20 days of public interest in the beginning of the next month. "

And aiming at the fact that the entity is a sentence, performing word segmentation on the sentence, acquiring word vectors of all words through FastText, and adding the word vectors to form a sentence vector senVec. Acquiring word vectors of node names and attributes of candidate entities of Korean red, Yaoming and octopus, adding the word vectors to form attrVec, and then calculating vector similarity by using senVec and attrVec:

| x | | represents the norm of vector x to obtain the scoresenScore。

And then acquiring the node names and attributes of the candidate entities of Korean red, Yaoming and octoyi and word vectors of the first-degree relation node names and attributes of the candidate entities, adding the node names and attributes to form firstLeVec, and performing similarity calculation with the text abstract textVec, wherein the formula is the same as the formula above, so as to obtain the score firstLeScore.

And acquiring the node names and attributes of the candidate entities of Korean, Yaming and Octope and word vectors of the two-degree relation node names and attributes of the candidate entities, adding the word vectors to form secondRelVec, and calculating the vector similarity with the text abstract textVec, wherein the formula is the same as the formula, so as to obtain the score secondRelScore.

The calculated scores are weighted differently and are configurable, and if the score of the one-degree relationship is more weighted, the weight of firstDelScore is set higher, assuming 0.7, the remaining score weight senScore is 0.2, and secondreScore is 0.1, and the scores are multiplied by the weights and summed sum. And comparing sum of each entity with a set threshold, if the sum is greater than the threshold, associating, and obtaining a page list of associated entities according to the page corresponding to each entity obtained from the elastic search.

Fig. 2 is a block diagram of a system for associating knowledge base documents and knowledge graph entities according to an embodiment of the present invention, where functional principles of various modules in the system have been described in the foregoing method embodiment, and are not described in detail below.

As shown in fig. 2, the system includes:

Optionally, in this embodiment, the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.

Optionally, in this embodiment, the system further includes:

FIG. 3 is a schematic diagram illustrating a computing device according to an exemplary embodiment of the present invention.

Referring to fig. 3, computing device 300 includes memory 310 and processor 320.

The Processor 320 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 310 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions for the processor 320 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 310 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 310 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 310 has stored thereon executable code that, when processed by the processor 320, may cause the processor 320 to perform some or all of the methods described above.

The aspects of the invention have been described in detail hereinabove with reference to the drawings. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. Those skilled in the art should also appreciate that the acts and modules referred to in the specification are not necessarily required by the invention. In addition, it can be understood that the steps in the method according to the embodiment of the present invention may be sequentially adjusted, combined, and deleted according to actual needs, and the modules in the device according to the embodiment of the present invention may be combined, divided, and deleted according to actual needs.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out some or all of the steps of the above-described method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for associating knowledge base documents with knowledge-graph entities, comprising:

performing entity identification on the text to obtain an entity list;

2. The method according to claim 1, wherein the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.

3. The method of claim 1 or 2, further comprising:

and inquiring the position of the entity in the entity list in the document of the knowledge base to obtain a position list corresponding to the entity.

4. The method of claim 3, further comprising:

and emphasizing the format of the entity of the document of the knowledge base in the position of the position list.

5. A system for associating knowledge base documents with knowledge-graph entities, comprising:

6. The system according to claim 5, wherein the first feature information is a sum of word vectors of feature words of the text, and the second feature information is a sum of word vectors of node names and attributes.

7. The system of claim 5 or 6, further comprising:

8. The system of claim 7, further comprising:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method of any of claims 1 to 4.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method according to any one of claims 1 to 4.