Disclosure of Invention
The invention aims to construct a unique ID of an event after disambiguation of the event, and provides a unique ID construction method of the event based on event disambiguation.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
the method for constructing the event unique ID based on the event disambiguation comprises the following steps:
collecting a plurality of text data, and carrying out preliminary disambiguation on the collected text data;
respectively analyzing text texts of the plurality of text data subjected to preliminary disambiguation, and outputting respective text basic data;
and performing deep disambiguation on each output text basic data respectively, and outputting text cache data with a unique event, wherein the text cache data comprises a unique index ID and an event ID corresponding to the index ID.
The step of collecting a plurality of text data and preliminarily disambiguating the collected text data comprises the following steps:
extracting a text title, a text source website name and a text release date in each text data, if the text data with the same text title, the same text data are removed and only one text data is reserved;
the text title, the text source website name and the text release date of the reserved text data are encrypted by using an MD5 encryption algorithm to generate a related information unique ID.
The step of respectively carrying out text body analysis on the plurality of text data after the preliminary disambiguation and outputting text basic data comprises the following steps of:
analyzing the text of each text data by using an NLP (non line segment) natural language processing method, and extracting an event type, an event subject, a start time, an end time and an event object in the text;
linking the extracted standard names of the event subject and the event object in an entity linking mode, so that the event subject and the event object have the standard names, matching the subject ID with the event subject, and matching the object ID with the event object;
and establishing an index ID according to the extracted event type, the event subject, the start time, the end time and the event object, and finishing text basic data output.
The step of establishing an index ID according to the extracted event type, event subject, start time, end time and event object includes:
if only one event type, event subject, start time, end time or event object is extracted, establishing an index ID corresponding to the event type, the event subject, the start time, the end time and the event object;
if the number of the extracted event types, event subjects, start times, end times or event objects is more than one, index IDs corresponding to the event types, the event subjects, the start times, the end times and the event objects are respectively established, and a distinguishing suffix is input after each index ID.
The step of performing deep disambiguation on each output text basic data respectively and outputting text cache data, wherein the text cache data comprises an event ID and an index ID, and the step comprises the following steps of:
if the text basic data only contains one index ID, generating an event ID for the index ID by using an MD5 encryption algorithm, and outputting text cache data, wherein the text cache data comprises a group of event ID-index ID;
if the text basic data contains a plurality of index IDs, whether event types, event subjects, starting time, ending time and event objects corresponding to the index IDs are the same or not is judged, if yes, the same index IDs are removed, the reserved index IDs are generated into event IDs by using an MD5 encryption algorithm, and text cache data is output, wherein the text cache data comprises one or more groups of event IDs-index IDs.
Compared with the prior art, the invention has the beneficial effects that:
the invention can solve the problem of controlling the data quality, the data standard, the data source and the like in the steps of data acquisition, data fusion, data analysis and the like in the construction process of the enterprise data warehouse by forming the unique ID system of the event through disambiguation of the text data.
Under the scenes of a multi-source heterogeneous data fusion scheme, a standard unified data ID system, unstructured data processing and the like, corresponding IDs are matched in the disambiguation process, so that the uniqueness and traceability of data are effectively guaranteed, and a standard basic framework is provided for subsequent data analysis and data mining work.
Example (b):
the invention is realized by the following technical scheme, as shown in fig. 1 and 2, the invention provides an event unique ID construction method based on event disambiguation, which comprises the following steps:
step S100: collecting a plurality of text data, and performing preliminary disambiguation on the collected text data.
The text data is collected by using a crawler or other methods, and the collected text data can be various data such as news public sentiments, business basic information, judicial litigation, administrative penalties and the like, and the embodiment exemplifies the news public sentiments as the text data. For example, if 10 news opinions are collected, the 10 news opinions are preliminarily disambiguated.
A plurality of fields such as text title, text source website name, text release date, version number, keyword, body, etc. can be extracted for each news opinion. The method comprises the steps of independently extracting 3 fields of a text title, a text source website name and a text release date of each news public opinion, judging whether the 3 fields of the text title, the text source website name and the text release date of the 10 news public opinions are completely the same, if the 3 fields are completely the same, indicating that the 10 news public opinions are the same, removing the same news public opinions, and only keeping one, so as to finish preliminary disambiguation of the 10 collected news public opinions.
For example, the 3 fields of the text title, the name of the text source website and the text release date of 4 news opinions are shown in table 1:
| chinese character field | English field | News public opinion 1 | News public opinion 2 | News public opinion 3 | News public opinion 4 |
| Text title | news_title | A and B cooperate | A and B cooperate | Marriage of C and D | E acquisition F |
| Text source network name | news_site | Data view | Data view | Self-service net | Data view |
| Date of text release | pubdate | 2020-01-01 | 2020-01-01 | 2020-01-01 | 2020-01-02 |
TABLE 1
As can be seen from table 1, the 3 fields of the text titles, the names of the websites from which the texts are sourced, and the dates of text release of the news opinions 1 and 2 are all identical, so that it is indicated that the news opinions 1 and 2 are identical, and duplicate news opinions are removed, and only one of the news opinions is reserved.
After preliminary disambiguation by comparing the 3 fields, according to the reserved text title, text source website name and text release date of each news opinion, the MD5 encryption method is used for encrypting to generate the unique ID of the relevant information. For example, after the news public opinion 1 generates the relevant information unique ID, the field description shown in table 2 can be formed by combining the aforementioned various extractable fields:
| english field | Chinese character field |
| bbd_xgxx_id | Unique ID of related information |
| bbd_table | Table name |
| bbd_type | Watch type |
| uptime | Time stamp |
| do_time | Date grabbing |
| version | Version number |
| bbd_seed | Keyword |
| bbd_url | Grabbing chain |
| news_title | Text title |
| pubdate | Date of text release |
| news_site | Text source website name |
| main | Text |
TABLE 2
For example, after adding the actual value field of news opinion 1 to table 2, a field description as shown in table 3 can be formed:
| english field | Value taking |
| bbd_xgxx_id | f28a1d555831272ad0a2b7b0922ca564 |
| bbd_table | qyxg_yuqing |
| bbd_type | sic_chinacoal |
| uptime | 1557924294 |
| do_time | 2019-05-15 |
| version | 1 |
| bbd_seed | Data link nameplate cooperation |
| bbd_url | http://www.cbdio.com/BigData/2018-12/26/content_5966169.htm |
| news_title | Strategic cooperation of data link with Yiborui leading information service company |
| pubdate | 2018-12-26 10:24:23 |
| news_site | Data view |
| main | 12.25 am, Chengdu Dai Ming dynasty together with Kangming scientific & technical limited company (BBD for short) and Yiborui communicationThe SeitchToolTown (Beijing) company Limited enters into a strategic cooperation agreement in Shanghai. Both parties will be based on eachThe advantages of technical and market resources, and the joint exploration of big data in the aspects of general finance, retail risk management,Credit service innovation, and the like. 29 months in 2018, Chinese ship weightThe group of industry, the government of people in Heilongjiang province and the government of people in Halrison in HarrisonAnd signing a deepening strategic cooperation agreement. |
TABLE 3
It should be noted that, the fields that can be extracted from one text data have at least 3 fields of text title, text source website name, and text release date, other fields can be extracted according to the actual situation definition, and the english field corresponding to the chinese field can also be translated according to the actual situation, and tables 1, 2, and 3 are only examples for easy understanding.
After the preliminary disambiguation and the field encryption, the text data of each news public opinion can be output for further analysis.
Step S200: and respectively carrying out text body analysis on the plurality of text data after the preliminary disambiguation, and outputting respective text basic data.
And respectively analyzing the text of each news public opinion by using an NLP natural language processing method, and extracting 5 fields of an event type, an event subject, a starting time, an ending time and an event object in the text, wherein the event type, the event subject and the starting time are fields which must exist, and the ending time and the event object can be empty fields. For example, there may be only one subject (event subject) in a news opinion, and what the subject does (event type) and start time, and there is no opposite object (event object) and end time.
And if the event object exists in the news public opinion, the extracted standard names of the event subject and the event object are linked in an entity linking mode, so that the event subject and the event object have the standard names. For example, if the event subject extracted from the news opinion is "digital link product" or "BBD", it needs to be linked to the standard name "digital link product technology limited", that is, the full name. And then, after the event subject and the standard name of the event object are linked, matching the subject ID for the event subject and matching the object ID for the event object. If the extracted event object is an empty field, standard name linking is not needed to be carried out on the event object; or the extracted event subject and event object are standard names, and standard name linkage is not needed.
Since a news opinion may include a plurality of event subjects, a plurality of event objects, or a plurality of event types, the text data after the preliminary disambiguation is analyzed to output text base data.
Judging the number of event types, event subjects, starting time, ending time or event objects extracted from the news public sentiment, and if only one event type, event subject, starting time, ending time and event object is extracted, directly establishing an index ID corresponding to the event type, event subject, starting time, ending time and event object. And if the event object or the field of the end time does not exist, establishing an index ID corresponding to the existing field. Assume that the preliminary disambiguated text base data as shown in table 4 is formed:
| english field | Chinese character field |
| bbd_xgxx_id | Unique ID of related information |
| search_id | Index ID |
| event_type | Event type |
| event_subject | Event body |
| subject_id | Principal ID |
| pubdate | Date of release |
| start_time | Starting time |
| end_time | End time |
| event_object | Event object |
| object_id | Object ID |
TABLE 4
If it is determined that there is more than one event type, event subject, start time, end time or event object extracted from the text data, for example, the news public opinion text subjected to preliminary disambiguation is shown in table 5:
| main | 12.25 am, Chengdu-Ding-Zhi-Lin-Ming-Tech and Yiborui-Information-Tech-SichShanghai signed a strategic cooperation agreement. Both parties will jointly explore based on respective technology and market resource advantagesThe big data is practically applied in the fields of general finance, retail risk management, credit service innovation and the like.29 th 8.8.2018, China Ship re-engineering group Limited, the government of people in Heilongjiang province, HarbinThe municipality signs a deep strategic cooperative agreement at Harbin. |
TABLE 5
An event type field which can be extracted from the text of the news public opinion is ' enterprise cooperation ', an event subject comprises ' Chengdu digital associated data technology company limited (hereinafter referred to as digital associated data) ' Chinese ship re-engineering group limited (hereinafter referred to as Chinese ship) ', and an event object comprises ' Yiborui information technology company limited (hereinafter referred to as Yiborui), ' Heilongjiang people government and ' Halrison city people government '.
The event type, event subject, start time, end time, event object extracted in this way can form the text base data shown in table 6:
| english field | Value 1 | Value 2 | Value 3 |
| bbd_xgxx_id | f28a1d555831272ad0a2b7b0922ca564 | f28a1d555831272ad0a2b7b0922ca564 | f28a1d555831272ad0a2b7b0922ca564 |
| search_id | f28a1d555831272ad0a2b7b0922ca564_1 | f28a1d555831272ad0a2b7b0922ca564_2_1 | f28a1d555831272ad0a2b7b0922ca564_2_2 |
| event_type | Enterprise collaboration | Enterprise collaboration | Enterprise collaboration |
| event_subject | Number linked name plate | China Shipbuilding Heavy Industry Group Co.,Ltd. | China Shipbuilding Heavy Industry Group Co.,Ltd. |
| subject_id | 17988de145c14f808fd2ffa0dc1399d7 | 89988de145c14f808fd2ffa0dc1399d7 | 89988de145c14f808fd2ffa0dc1399d7 |
| pubdate | 2015-06-15 12:15:00 | 2018-09-03 00:00:00 | 2018-09-03 00:00:00 |
| start_time | 2018-12-26 | 2018-08-29 | 2018-08-29 |
| end_time | | | |
| event_object | Yiborui | Government of Heilongjiang province | Harbin city government |
| object_id | 88988de145c14f808fd2ffa0dc1399d7 | null | null |
TABLE 6
As can be seen from table 6, the text data body includes three event types, each event type has its corresponding event subject, event object, start time, and no end time. Index IDs corresponding to the three 'event types, event subjects, start times and event objects' are respectively established, and a distinguishing suffix is input after each index ID.
For example, when two event subjects of 'number associated nameplate' and 'Chinese ship' exist in three event types, a first-level distinguishing suffix _1 is input after an index ID corresponding to the 'number associated nameplate', and a first-level distinguishing suffix _2 is input after the index ID corresponding to the 'Chinese ship'; the two event subjects are the event objects corresponding to the Chinese ship, namely, the 'Huilongjiang people government' and the 'Harbin city people government', then a second-level distinguishing suffix _1 is input after an index ID corresponding to the 'Heilongjiang people government' to form a _2_1, and similarly, a second-level distinguishing suffix _2 is input after the index ID corresponding to the 'Harbin city people' to form a _2_2 to be used as the division of the index ID.
Step S300: and performing deep disambiguation on each output text basic data respectively, and outputting text cache data, wherein the text cache data comprises a unique index ID and an event ID corresponding to the index ID.
If the text basic data only contains one index ID, an event ID is generated by using an MD5 encryption algorithm for the index ID, and text cache data is output, wherein the text cache data comprises a group of unique index IDs and event IDs corresponding to the index IDs, namely a group of event IDs-index IDs. If the field of the event object or the ending time corresponding to the index ID is empty, at least after the event object in the empty field is matched with the object ID in the empty field, the four fields of the event type, the subject ID, the object ID and the starting time are formed to be encrypted by the MD5, and the event ID cannot be formed by directly encrypting without the field of the object ID.
If the text basic data contains a plurality of index IDs, whether the 'event type-event subject-start time-end time-event object' corresponding to the plurality of index IDs are the same or not is judged, if the 'event type-event subject-start time-end time-event object' corresponding to the plurality of index IDs are the same, the same index ID needs to be removed, the reserved index ID is used for generating the event ID by using an MD5 encryption algorithm, deep disambiguation on of the text basic data is completed, and text cache data unique to the event is formed, wherein the text cache data comprises one or more groups of unique index IDs and event IDs corresponding to the index IDs, namely one or more groups of 'index IDs-event IDs'.
For example, the "event type-event subject-start time-end time-event object" corresponding to three groups of index IDs extracted from the text base data of a news opinion is shown in table 7:
| english field | Value 1 | Value 2 | Value 3 |
| bbd_xgxx_id | f28a1d555831272ad0a2b7b0922ca564 | f28a1d555831272ad0a2b7b0922ca564 | f28a1d555831272ad0a2b7b0922ca564 |
| search_id | f28a1d555831272ad0a2b7b0922ca564_1_1 | f28a1d555831272ad0a2b7b0922ca564_1_2 | f28a1d555831272ad0a2b7b0922ca564_2 |
| event_type | Enterprise collaboration | Enterprise collaboration | Enterprise collaboration |
| event_subject | Number linked name plate | Number linked name plate | China Shipbuilding Heavy Industry Group Co.,Ltd. |
| subject_id | 17988de145c14f808fd2ffa0dc1399d7 | 89988de145c14f808fd2ffa0dc1399d7 | 89988de145c14f808fd2ffa0dc1399d7 |
| pubdate | 2015-06-15 12:15:00 | 2015-06-15 12:15:00 | 2018-09-03 00:00:00 |
| start_time | 2018-12-26 | 2018-12-26 | 2018-08-29 |
| end_time | 2018-12-28 | 2018-12-28 | |
| event_object | Yiborui | Yiborui | Harbin city government |
| object_id | 88988de145c14f808fd2ffa0dc1399d7 | 88988de145c14f808fd2ffa0dc1399d7 | null |
TABLE 7
As can be seen from table 7, the "event type-event subject-start time-end time-event object" corresponding to the first group index ID is completely the same as the "event type-event subject-start time-end time-event object" corresponding to the second group index ID, which indicates that the event corresponding to the first group index ID and the event corresponding to the second group index ID are the same event, and therefore the same event needs to be removed, and only one event needs to be kept, that is, deep disambiguation of text base data is completed, and text base data unique to the event is formed as shown in table 8:
| english field | Value 1 | Value 2 |
| bbd_xgxx_id | f28a1d555831272ad0a2b7b0922ca564 | f28a1d555831272ad0a2b7b0922ca564 |
| search_id | f28a1d555831272ad0a2b7b0922ca564_1_1 | f28a1d555831272ad0a2b7b0922ca564_2 |
| event_type | Enterprise collaboration | Enterprise collaboration |
| event_subject | Number linked name plate | China Shipbuilding Heavy Industry Group Co.,Ltd. |
| subject_id | 17988de145c14f808fd2ffa0dc1399d7 | 89988de145c14f808fd2ffa0dc1399d7 |
| pubdate | 2015-06-15 12:15:00 | 2018-09-03 00:00:00 |
| start_time | 2018-12-26 | 2018-08-29 |
| end_time | 2018-12-28 | |
| event_object | Yiborui | Harbin city government |
| object_id | 88988de145c14f808fd2ffa0dc1399d7 | null |
TABLE 8
An index ID reserved in text basic data subjected to deep disambiguation is generated by using an MD5 encryption algorithm to generate a unique event ID corresponding to the index ID, so as to achieve the purpose of data tracing, that is, text cache data shown in table 9 is stored in a standard library for subsequent services:
| english field | Chinese character field |
| event_id | Event id |
| search_id | Index id |
TABLE 9
The disambiguation processing and the ID matching are carried out on each piece of news public opinion, and the text cache data which is unique to the event corresponding to each text data finally can be output.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.