Disclosure of Invention
The invention provides a knowledge graph optimization method and device based on document attribute assignment entity weight and a computer readable storage medium, and mainly aims to solve the problems that data updating is not timely and data lag is serious in a knowledge graph.
In order to achieve the above object, the invention provides a knowledge graph optimization method based on document attribute assignment entity weight, which comprises the following steps:
obtaining an original document, and carrying out document field classification on the original document according to word characteristics in the original document to obtain a field text;
identifying an original entity library and an original relation library corresponding to the field text, and extracting text sentences from the field text according to original entity entries and original relation entries in the original entity library and the original relation library;
performing word segmentation processing on the text statement to obtain an affair entry set, and extracting a target entry set in the affair entry set according to the support and confidence of each entry in the affair entry set;
extracting sentences containing entries in the target entry set from the field text to obtain a target sentence set;
extracting candidate triples of each statement in the target statement set according to the verbs of the statements in the target statement set;
judging whether the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a preset relation similarity threshold;
if the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than a similarity threshold, judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold or not;
if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
and calculating entity weight values of the entity entries, and optimizing the initial relation library and the initial entity library according to the entity weight values to obtain a target relation library and a target entity library.
Optionally, the obtaining the original document includes:
acquiring a pre-constructed database table, and extracting structured data from the database table;
and crawling a document in a pre-constructed encyclopedia webpage, and integrating the structured data and the crawled document to obtain the original document.
Optionally, the performing document field classification on the original document according to the word features in the original document to obtain a field text includes:
extracting document keywords in the original document, and calculating attribute weight values of the document keywords;
calculating a vocabulary vector formula of the original document according to the attribute weight value;
and carrying out field classification on the original document by utilizing a pre-constructed clustering algorithm according to the vocabulary vector formula to obtain the field text.
Optionally, the extracting a target entry set in the transaction entry set according to the support degree and the confidence degree of each entry in the transaction entry set includes:
calculating the support degree of each entry in the transaction entry set by using a pre-constructed support degree calculation formula, and taking the entry with the support degree higher than a preset support threshold value as a frequent entry to obtain a frequent entry set;
calculating the confidence degrees of any two frequent term pairs in the frequent term set by using a pre-constructed confidence degree calculation formula, and extracting the frequent term pairs with the confidence degrees larger than a preset confidence threshold value from the frequent term set to obtain target frequent term pairs;
and integrating the words in the target complex term pair to obtain the target term set.
Optionally, the extracting, according to the verb of the sentence in the target sentence set, the candidate triple of each sentence in the target sentence set includes:
performing reference resolution and sentence simplification on each sentence in the target sentence set to obtain a standard sentence set;
identifying a verb of each sentence in the standard sentence set, and extracting a preposed noun and a postpositive noun of the verb in the standard sentence set according to the verb;
and integrating verbs, preposed nouns and postpositional nouns of each sentence in the standard sentence set to obtain the candidate triple.
Optionally, the calculating an entity weight value of the entity entry includes:
calculating the document weight of the original document according to the document grade and the reference times of the original document;
calculating the candidate weight of the candidate triple according to the occurrence frequency of the candidate triple in the database table and the encyclopedic webpage;
and calculating the entity weight value of the entity entry by utilizing a pre-constructed weight formula according to the document weight and the candidate weight.
Optionally, the optimizing the initial relational database and the initial entity database according to the entity weight value to obtain a target relational database and a target entity database includes:
identifying an approximate entry in the initial entity library that is most similar to the entity entry;
judging whether the entity weight value of the approximate entry is greater than the entity weight value of the entity entry;
if the entity weight value of the approximate entry is greater than the entity weight value of the entity entry, the approximate entry is not replaced;
if the entity weight value of the approximate entry is not greater than the entity weight value of the entity entry, replacing the approximate entry with the entity entry to obtain the target entity library;
and correcting the relation entries in the initial relation library by using the candidate triple according to the entity entries to obtain the target relation library.
In order to solve the above problem, the present invention further provides a knowledge graph optimization apparatus for assigning entity weights based on document attributes, the apparatus comprising:
the system comprises an original document classification module, a domain classification module and a domain classification module, wherein the original document classification module is used for acquiring an original document, and performing document domain classification on the original document according to word characteristics in the original document to obtain a domain text;
the target sentence set extraction module is used for identifying an original entity library and an original relation library corresponding to the field text and extracting text sentences from the field text according to original entity entries and original relation entries in the original entity library and the original relation library; performing word segmentation processing on the text statement to obtain an affair entry set, and extracting a target entry set in the affair entry set according to the support and confidence of each entry in the affair entry set; extracting sentences containing entries in the target entry set from the field text to obtain a target sentence set;
the candidate triple construction module is used for extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
an initial relation library and initial entity library construction module, configured to determine whether a similarity between a relation entry in the candidate triple and any original relation entry in the original relation library is greater than a preset relation similarity threshold; if the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library; if the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than a similarity threshold, judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold or not; if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library; if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
and the initial relation library and initial entity library optimization module is used for calculating entity weight values of the entity entries, and optimizing the initial relation library and the initial entity library according to the entity weight values to obtain a target relation library and a target entity library.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to implement the method for knowledge-graph optimization based on assigning entity weights based on document attributes described above.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is executed by a processor in an electronic device to implement the above-mentioned method for optimizing a knowledge graph based on assigning entity weights based on document attributes.
Compared with the background art: the embodiment of the invention obtains a domain text by carrying out document domain classification on the original document, extracts a text sentence in the domain text according to an original entity vocabulary entry and an original relation vocabulary entry in an original entity library and an original relation vocabulary entry so as to extract a target sentence set in the text sentence, extracts a candidate triple of each sentence in the target sentence set according to a verb of the sentence in the target sentence set, judges whether the candidate triple can be coded in the original relation library and the original entity library or not by judging the relation vocabulary entry and the entity vocabulary entry in the candidate triple, obtains the initial relation library and the initial entity library when the candidate triple can be coded, and optimizes the initial relation library and the initial entity library according to the entity weight, and obtaining a target relational database and a target entity database. Therefore, the method, the device, the electronic equipment and the computer-readable storage medium for optimizing the knowledge graph based on the document attribute assignment entity weight can solve the problems that the data of the knowledge graph is not updated timely and the data lag is serious.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a knowledge graph optimization method based on document attribute assignment entity weight. The implementation subject of the knowledge graph optimization method based on document attribute assignment entity weight includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to implement the method provided by the embodiments of the present application. In other words, the method for optimizing the knowledge-graph based on the assignment of the entity weight to the document attribute may be performed by software or hardware installed in a terminal device or a server device. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
referring to fig. 1, a flowchart of a method for optimizing a knowledge graph based on document attribute assignment entity weight according to an embodiment of the present invention is shown. In this embodiment, the method for optimizing a knowledge graph based on document attribute assignment entity weight includes:
s1, obtaining an original document, and carrying out document field classification on the original document according to the word characteristics in the original document to obtain a field text.
Interpretable, the original document refers to a document that can be in various industry domains, such as: financial, medical, securities investment, and education. The term characteristics may be industry attributes of the term, such as: financial industry attributes, medical industry attributes, and the like.
In an embodiment of the present invention, the obtaining an original document includes:
acquiring a pre-constructed database table, and extracting structured data from the database table;
and crawling a document in a pre-constructed encyclopedia webpage, and integrating the structured data and the crawled document to obtain the original document.
Understandably, the original document can be obtained in two channels, one is the data of the business itself, and the part of the data is usually contained in a database table in the company and is stored in a structured way; another is to capture public data on the web, which is usually in the form of web pages and therefore unstructured.
In detail, referring to fig. 2, the classifying the original document according to the word feature in the original document to obtain a domain text includes:
s11, extracting the document keywords in the original document, and calculating the attribute weight values of the document keywords;
s12, calculating a vocabulary vector formula of the original document according to the attribute weight value;
and S13, performing field classification on the original document according to the vocabulary vector formula by using a pre-constructed clustering algorithm to obtain the field text.
Interpretable, extracting the document keywords from the original document by using an artificial intelligence technology, then respectively calculating the attribute weight values of the documents, and obtaining a vocabulary vector formula of the original document according to the magnitude sequence of the attribute weight values of the words; and clustering and summarizing the vocabulary vector of the original document by utilizing a clustering algorithm in combination with the industry characteristics, homing the original document into a certain field according to the clustering characteristics, and storing the original document into a text library of the field.
And S2, identifying an original entity library and an original relation library corresponding to the field text, and extracting text sentences from the field text according to original entity terms and original relation terms in the original entity library and the original relation library.
In the embodiment of the present invention, the text sentence refers to a sentence including the original entity entry or the original relation entry.
And S3, performing word segmentation processing on the text statement to obtain a transaction entry set, and extracting a target entry set in the transaction entry set according to the support degree and the confidence degree of each entry in the transaction entry set.
Explicably, the support refers to how often an entry in the set of transaction entries appears within the domain text. The confidence level refers to how frequently an entry in the set of transaction entries appears in pairs with another entry in the domain text.
It should be understood that the target entry set refers to an entry in the transaction entry set whose support degree and confidence degree of a certain entry are greater than a preset support threshold and a preset confidence threshold, respectively.
In this embodiment of the present invention, the extracting a target entry set from the transaction entry set according to the support and the confidence of each entry in the transaction entry set includes:
calculating the support degree of each entry in the transaction entry set by using a pre-constructed support degree calculation formula, and taking the entry with the support degree higher than a preset support threshold value as a frequent entry to obtain a frequent entry set, wherein the support degree calculation formula is as follows:
wherein,
the degree of support is represented by a number of,
denotes the first
The number of times an individual entry appears within the domain text,
representing all words within the domain text;
calculating the confidence degrees of any two frequent term pairs in the frequent term set by using a pre-constructed confidence degree calculation formula, extracting the frequent term pairs with the confidence degrees larger than a preset confidence threshold value in the frequent term set to obtain a target frequent term pair, wherein the confidence degree calculation formula is as follows:
wherein,
the confidence level is indicated and the confidence level is indicated,
representing the second in the domain text
The number of the frequent terms is one,
representing a total number of frequent terms within the domain text,
indicating a sequence number of a frequent term of the pair of frequent terms,
representing the first in said frequent term pair
The individual frequent term and the second in the domain text
The number of times that the individual frequent terms occur together,
indicating a serial number of
The number of times a frequent term of (a) occurs with another frequent term of the pair of frequent terms;
and integrating the words in the target complex term pair to obtain the target term set.
It can be explained that the support and confidence are the existing concepts, and are not described herein again.
And S4, extracting sentences containing the entries in the target entry set from the field text to obtain a target sentence set.
And S5, extracting the candidate triple of each sentence in the target sentence set according to the verb of the sentence in the target sentence set.
Interpretable, the candidate triple refers to the candidate entity word and the relation word to be stored in the original relation library and the original entity library. A triple (Subject-prediction-Object, SPO for short) refers to the correlation relationship of the associated entities like entity 1-relationship-entity 2.
In detail, referring to fig. 2, the extracting a candidate triple of each sentence in the target sentence set according to the verb of the sentence in the target sentence set includes:
s51, performing reference resolution and sentence simplification on each sentence in the target sentence set to obtain a standard sentence set;
s52, identifying a verb of each sentence in the standard sentence set, and extracting a preposed noun and a postpositive noun of the verb in the standard sentence set according to the verb;
and S53, integrating verbs, prepositive nouns and postpositive nouns of each sentence in the standard sentence set to obtain the candidate triple.
Explainably, the resolution of the reference refers to the unified representation of words referring to the same entity. The sentence simplification refers to deleting modifiers and the like in the drama so as to realize the simplification of the sentence. The preposed noun refers to a subject of the verb, and the postposed noun refers to a verb object such as an object of the verb.
And S6, judging whether the similarity between the relation vocabulary entry in the candidate triple and any original relation vocabulary entry in the original relation library is larger than a preset relation similarity threshold.
Understandably, the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a preset relation similarity threshold, which indicates that the relation entries in the candidate triple can be compiled into the original relation library. The relationship similarity threshold may be determined according to similarity between a large number of relationship entries not belonging to the original relationship library and a large number of relationship entries belonging to the original relationship library and any original relationship entry in the original relationship library. The similarity can be calculated according to the industry attribute, the part of speech, the relation object and other factors of the vocabulary entry. And if the industry attributes, the parts of speech and the relation objects of the two entries are also close, the similarity of the two entries is larger.
And if the similarity between the relation vocabulary entry in the candidate triple and any original relation vocabulary entry in the original relation library is greater than a similarity threshold, executing S7, and storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library.
In the embodiment of the present invention, the relational terms in the candidate triple may be compiled into the original relational library, which means that the entity terms in the candidate triple may also be compiled into the initial entity library.
If the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than the similarity threshold, executing S8, and judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold.
In the embodiment of the present invention, the entity similarity threshold and the similarity of the original entity entry are consistent with the determination manner of the relationship similarity threshold and the original relationship entry, and are not repeated here.
And if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, executing S9, and storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library.
And if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, executing the step of returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set.
In the embodiment of the invention, verbs can be circularly extracted from the target sentence set of the field text until all suitable candidate triples are obtained.
And S10, calculating entity weight values of the entity entries, and optimizing the initial relation library and the initial entity library according to the entity weight values to obtain a target relation library and a target entity library.
Explicably, the entity weight value refers to a numerical value of the importance degree of the entity entry calculated according to various factors.
In an embodiment of the present invention, the calculating an entity weight value of the entity entry includes:
calculating the document weight of the original document according to the document grade and the reference times of the original document;
calculating the candidate weight of the candidate triple according to the occurrence frequency of the candidate triple in the database table and the encyclopedic webpage;
calculating an entity weight value of the entity entry by using a pre-constructed weight formula according to the document weight and the candidate weight, wherein the weight formula is as follows:
wherein,
the weight value of the entity is represented,
the weight of the document is represented and,
the weight coefficient of the document is represented,
the weight of the candidate is represented by,
representing candidate weight coefficients.
In the embodiment of the present invention, the document weight coefficient may be a coefficient greater than 0 and smaller than 1, and plays a role in adjusting the weight. The document weight and the candidate weight respectively refer to the importance degree of the original document and the candidate triple.
In this embodiment of the present invention, the optimizing the initial relational database and the initial entity database according to the entity weight to obtain a target relational database and a target entity database includes:
identifying an approximate entry in the initial entity library which is most similar to the entity entry;
judging whether the entity weight value of the approximate entry is greater than the entity weight value of the entity entry;
if the entity weight value of the approximate entry is greater than the entity weight value of the entity entry, the approximate entry is not replaced;
if the entity weight value of the approximate entry is not greater than the entity weight value of the entity entry, replacing the approximate entry with the entity entry to obtain the target entity library;
and correcting the relation entries in the initial relation library by using the candidate triple according to the entity entries to obtain the target relation library.
Compared with the background art: the embodiment of the invention obtains a domain text by carrying out document domain classification on the original document, extracts a text sentence in the domain text according to an original entity vocabulary entry and an original relation vocabulary entry in an original entity library and an original relation vocabulary entry so as to extract a target sentence set in the text sentence, extracts a candidate triple of each sentence in the target sentence set according to a verb of the sentence in the target sentence set, judges whether the candidate triple can be coded in the original relation library and the original entity library or not by judging the relation vocabulary entry and the entity vocabulary entry in the candidate triple, obtains the initial relation library and the initial entity library when the candidate triple can be coded, and optimizes the initial relation library and the initial entity library according to the entity weight, and obtaining a target relational database and a target entity database. Therefore, the method, the device, the electronic equipment and the computer-readable storage medium for optimizing the knowledge graph based on the document attribute assignment entity weight can solve the problems that the data of the knowledge graph is not updated timely and the data lag is serious.
Example 2:
fig. 4 is a functional block diagram of a knowledge-graph optimization apparatus for assigning entity weights based on document attributes according to an embodiment of the present invention.
The knowledge-graph optimization apparatus 100 for assigning entity weights based on document attributes according to the present invention can be installed in an electronic device. According to the realized function, theapparatus 100 for optimizing a knowledge graph based on document attribute assignment entity weight may include an originaldocument classification module 101, a target statement setextraction module 102, a candidatetriple construction module 103, an initial relation library and initial entitylibrary construction module 104, and an initial relation library and initial entitylibrary optimization module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
The originaldocument classification module 101 is configured to obtain an original document, and perform document domain classification on the original document according to word features in the original document to obtain a domain text;
the target sentenceset extraction module 102 is configured to identify an original entity library and an original relation library corresponding to the domain text, and extract a text sentence in the domain text according to an original entity entry and an original relation entry in the original entity library and the original relation library; performing word segmentation processing on the text statement to obtain an affair entry set, and extracting a target entry set in the affair entry set according to the support and confidence of each entry in the affair entry set; extracting sentences containing entries in the target entry set from the field text to obtain a target sentence set;
the candidatetriple constructing module 103 is configured to extract a candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
the initial relationship library and initial entity library construction module 104 is configured to determine whether a similarity between a relationship entry in the candidate triple and any original relationship entry in the original relationship library is greater than a preset relationship similarity threshold; if the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library; if the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than a similarity threshold, judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold or not; if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library; if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
and the initial relational database and initialentity database optimization 105 is used for calculating entity weight values of the entity entries, and optimizing the initial relational database and the initial entity database according to the entity weight values to obtain a target relational database and a target entity database.
In detail, when the modules in theapparatus 100 for optimizing a knowledge graph based on document attribute assignment entity weight according to the embodiment of the present invention are used, the same technical means as the above method for optimizing a knowledge graph based on document attribute assignment entity weight shown in fig. 1 is adopted, and the same technical effects can be produced, which is not described herein again.
Example 3:
fig. 5 is a schematic structural diagram of an electronic device for implementing a method for optimizing a knowledge graph based on document attribute assignment entity weights according to an embodiment of the present invention.
Theelectronic device 1 may comprise aprocessor 10, amemory 11, abus 12 and acommunication interface 13, and may further comprise a computer program stored in thememory 11 and executable on theprocessor 10, such as a knowledge-graph optimization program for assigning entity weights based on document attributes.
Thememory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. Thememory 11 may in some embodiments be an internal storage unit of theelectronic device 1, such as a removable hard disk of theelectronic device 1. Thememory 11 may also be an external storage device of theelectronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on theelectronic device 1. Further, thememory 11 may also include both an internal storage unit and an external storage device of theelectronic device 1. Thememory 11 may be used not only for storing application software installed in theelectronic device 1 and various types of data, such as code of a knowledge-graph optimization program assigning entity weights based on document attributes, etc., but also for temporarily storing data that has been output or is to be output.
Theprocessor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. Theprocessor 10 is a Control Unit of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, executes or executes programs or modules (for example, a knowledge map optimization program for assigning entity weights based on document attributes, and the like) stored in thememory 11, and calls data stored in thememory 11 to perform various functions of theelectronic device 1 and process data.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between thememory 11 and at least oneprocessor 10 or the like.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of theelectronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, theelectronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least oneprocessor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. Theelectronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, theelectronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between theelectronic device 1 and other electronic devices.
Optionally, theelectronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in theelectronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The knowledge-graph optimization program stored in thememory 11 of theelectronic device 1 and based on the document attribute assignment entity weight is a combination of a plurality of instructions, which when executed in theprocessor 10, can realize:
obtaining an original document, and carrying out document field classification on the original document according to word characteristics in the original document to obtain a field text;
identifying an original entity library and an original relation library corresponding to the field text, and extracting text sentences from the field text according to original entity entries and original relation entries in the original entity library and the original relation library;
performing word segmentation processing on the text statement to obtain a transaction entry set, and extracting a target entry set in the transaction entry set according to the support and confidence of each entry in the transaction entry set;
extracting sentences containing entries in the target entry set from the field text to obtain a target sentence set;
extracting a candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
judging whether the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a preset relation similarity threshold;
if the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than a similarity threshold, judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold or not;
if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
and calculating entity weight values of the entity entries, and optimizing the initial relation library and the initial entity library according to the entity weight values to obtain a target relation library and a target entity library.
Specifically, the specific implementation method of theprocessor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 4, which is not repeated herein.
Further, the integrated modules/units of theelectronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor of an electronic device, implements:
acquiring an original document, and performing document field classification on the original document according to word characteristics in the original document to obtain a field text;
identifying an original entity library and an original relation library corresponding to the field text, and extracting text sentences from the field text according to original entity entries and original relation entries in the original entity library and the original relation library;
performing word segmentation processing on the text statement to obtain an affair entry set, and extracting a target entry set in the affair entry set according to the support and confidence of each entry in the affair entry set;
extracting sentences containing entries in the target entry set from the field text to obtain a target sentence set;
extracting a candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
judging whether the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a preset relation similarity threshold;
if the similarity between the relation entries in the candidate triple and any original relation entry in the original relation library is greater than a similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the verb in the candidate triple and any original relation vocabulary entry in the original relation library is not larger than a similarity threshold, judging whether the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is larger than a preset entity similarity threshold or not;
if the similarity between the entity entry in the candidate triple and any original entity entry in the original entity library is greater than the entity similarity threshold, storing the candidate triple into the original relation library and the original entity library to obtain an initial relation library and an initial entity library;
if the similarity between the entity vocabulary entry in the candidate triple and any original entity vocabulary entry in the original entity library is not larger than the entity similarity threshold, returning to the step of extracting the candidate triple of each statement in the target statement set according to the verb of the statement in the target statement set;
and calculating entity weight values of the entity entries, and optimizing the initial relation library and the initial entity library according to the entity weight values to obtain a target relation library and a target entity library.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.