chinese character field	English field	News public opinion 1	News public opinion 2	News public opinion 3	News public opinion 4
						Text title	news_title	A and B cooperate	A and B cooperate	Marriage of C and D	E acquisition F
Text source network name	news_site	Data view	Data view	Self-service net	Data view
						Date of text release	pubdate	2020-01-01	2020-01-01	2020-01-01	2020-01-02

TABLE 1

As can be seen from table 1, the 3 fields of the text titles, the names of the websites from which the texts are sourced, and the dates of text release of the news opinions 1 and 2 are all identical, so that it is indicated that the news opinions 1 and 2 are identical, and duplicate news opinions are removed, and only one of the news opinions is reserved.

After preliminary disambiguation by comparing the 3 fields, according to the reserved text title, text source website name and text release date of each news opinion, the MD5 encryption method is used for encrypting to generate the unique ID of the relevant information. For example, after the news public opinion 1 generates the relevant information unique ID, the field description shown in table 2 can be formed by combining the aforementioned various extractable fields:

english field	Chinese character field
		bbd_xgxx_id	Unique ID of related information
bbd_table	Table name
		bbd_type	Watch type
uptime	Time stamp
		do_time	Date grabbing
version	Version number
		bbd_seed	Keyword
bbd_url	Grabbing chain
		news_title	Text title
pubdate	Date of text release
		news_site	Text source website name
main	Text

TABLE 2

For example, after adding the actual value field of news opinion 1 to table 2, a field description as shown in table 3 can be formed:

english field	Value taking
		bbd_xgxx_id	f28a1d555831272ad0a2b7b0922ca564
bbd_table	qyxg_yuqing
		bbd_type	sic_chinacoal
uptime	1557924294
		do_time	2019-05-15
version	1
		bbd_seed	Data link nameplate cooperation
bbd_url	http://www.cbdio.com/BigData/2018-12/26/content_5966169.htm
		news_title	Strategic cooperation of data link with Yiborui leading information service company
pubdate	2018-12-26 10:24:23
		news_site	Data view
main	12.25 am, Chengdu Dai Ming dynasty together with Kangming scientific & technical limited company (BBD for short) and Yiborui communicationThe SeitchToolTown (Beijing) company Limited enters into a strategic cooperation agreement in Shanghai. Both parties will be based on eachThe advantages of technical and market resources, and the joint exploration of big data in the aspects of general finance, retail risk management,Credit service innovation, and the like. 29 months in 2018, Chinese ship weightThe group of industry, the government of people in Heilongjiang province and the government of people in Halrison in HarrisonAnd signing a deepening strategic cooperation agreement.

TABLE 3

It should be noted that, the fields that can be extracted from one text data have at least 3 fields of text title, text source website name, and text release date, other fields can be extracted according to the actual situation definition, and the english field corresponding to the chinese field can also be translated according to the actual situation, and tables 1, 2, and 3 are only examples for easy understanding.

After the preliminary disambiguation and the field encryption, the text data of each news public opinion can be output for further analysis.

Step S200: and respectively carrying out text body analysis on the plurality of text data after the preliminary disambiguation, and outputting respective text basic data.

And respectively analyzing the text of each news public opinion by using an NLP natural language processing method, and extracting 5 fields of an event type, an event subject, a starting time, an ending time and an event object in the text, wherein the event type, the event subject and the starting time are fields which must exist, and the ending time and the event object can be empty fields. For example, there may be only one subject (event subject) in a news opinion, and what the subject does (event type) and start time, and there is no opposite object (event object) and end time.

And if the event object exists in the news public opinion, the extracted standard names of the event subject and the event object are linked in an entity linking mode, so that the event subject and the event object have the standard names. For example, if the event subject extracted from the news opinion is "digital link product" or "BBD", it needs to be linked to the standard name "digital link product technology limited", that is, the full name. And then, after the event subject and the standard name of the event object are linked, matching the subject ID for the event subject and matching the object ID for the event object. If the extracted event object is an empty field, standard name linking is not needed to be carried out on the event object; or the extracted event subject and event object are standard names, and standard name linkage is not needed.

Since a news opinion may include a plurality of event subjects, a plurality of event objects, or a plurality of event types, the text data after the preliminary disambiguation is analyzed to output text base data.

Judging the number of event types, event subjects, starting time, ending time or event objects extracted from the news public sentiment, and if only one event type, event subject, starting time, ending time and event object is extracted, directly establishing an index ID corresponding to the event type, event subject, starting time, ending time and event object. And if the event object or the field of the end time does not exist, establishing an index ID corresponding to the existing field. Assume that the preliminary disambiguated text base data as shown in table 4 is formed:

english field	Chinese character field
		bbd_xgxx_id	Unique ID of related information
search_id	Index ID
		event_type	Event type
event_subject	Event body
		subject_id	Principal ID
pubdate	Date of release
		start_time	Starting time
end_time	End time
		event_object	Event object
object_id	Object ID

TABLE 4

If it is determined that there is more than one event type, event subject, start time, end time or event object extracted from the text data, for example, the news public opinion text subjected to preliminary disambiguation is shown in table 5:

main

12.25 am, Chengdu-Ding-Zhi-Lin-Ming-Tech and Yiborui-Information-Tech-SichShanghai signed a strategic cooperation agreement. Both parties will jointly explore based on respective technology and market resource advantagesThe big data is practically applied in the fields of general finance, retail risk management, credit service innovation and the like.29 th 8.8.2018, China Ship re-engineering group Limited, the government of people in Heilongjiang province, HarbinThe municipality signs a deep strategic cooperative agreement at Harbin.

TABLE 5

An event type field which can be extracted from the text of the news public opinion is ' enterprise cooperation ', an event subject comprises ' Chengdu digital associated data technology company limited (hereinafter referred to as digital associated data) ' Chinese ship re-engineering group limited (hereinafter referred to as Chinese ship) ', and an event object comprises ' Yiborui information technology company limited (hereinafter referred to as Yiborui), ' Heilongjiang people government and ' Halrison city people government '.

The event type, event subject, start time, end time, event object extracted in this way can form the text base data shown in table 6:

english field	Value 1	Value 2	Value 3
				bbd_xgxx_id	f28a1d555831272ad0a2b7b0922ca564	f28a1d555831272ad0a2b7b0922ca564	f28a1d555831272ad0a2b7b0922ca564
search_id	f28a1d555831272ad0a2b7b0922ca564_1	f28a1d555831272ad0a2b7b0922ca564_2_1	f28a1d555831272ad0a2b7b0922ca564_2_2
				event_type	Enterprise collaboration	Enterprise collaboration	Enterprise collaboration
event_subject	Number linked name plate	China Shipbuilding Heavy Industry Group Co.,Ltd.	China Shipbuilding Heavy Industry Group Co.,Ltd.
				subject_id	17988de145c14f808fd2ffa0dc1399d7	89988de145c14f808fd2ffa0dc1399d7	89988de145c14f808fd2ffa0dc1399d7
pubdate	2015-06-15 12:15:00	2018-09-03 00:00:00	2018-09-03 00:00:00
				start_time	2018-12-26	2018-08-29	2018-08-29
end_time
				event_object	Yiborui	Government of Heilongjiang province	Harbin city government
object_id	88988de145c14f808fd2ffa0dc1399d7	null	null

TABLE 6

As can be seen from table 6, the text data body includes three event types, each event type has its corresponding event subject, event object, start time, and no end time. Index IDs corresponding to the three 'event types, event subjects, start times and event objects' are respectively established, and a distinguishing suffix is input after each index ID.

For example, when two event subjects of 'number associated nameplate' and 'Chinese ship' exist in three event types, a first-level distinguishing suffix _1 is input after an index ID corresponding to the 'number associated nameplate', and a first-level distinguishing suffix _2 is input after the index ID corresponding to the 'Chinese ship'; the two event subjects are the event objects corresponding to the Chinese ship, namely, the 'Huilongjiang people government' and the 'Harbin city people government', then a second-level distinguishing suffix _1 is input after an index ID corresponding to the 'Heilongjiang people government' to form a _2_1, and similarly, a second-level distinguishing suffix _2 is input after the index ID corresponding to the 'Harbin city people' to form a _2_2 to be used as the division of the index ID.

Step S300: and performing deep disambiguation on each output text basic data respectively, and outputting text cache data, wherein the text cache data comprises a unique index ID and an event ID corresponding to the index ID.

If the text basic data only contains one index ID, an event ID is generated by using an MD5 encryption algorithm for the index ID, and text cache data is output, wherein the text cache data comprises a group of unique index IDs and event IDs corresponding to the index IDs, namely a group of event IDs-index IDs. If the field of the event object or the ending time corresponding to the index ID is empty, at least after the event object in the empty field is matched with the object ID in the empty field, the four fields of the event type, the subject ID, the object ID and the starting time are formed to be encrypted by the MD5, and the event ID cannot be formed by directly encrypting without the field of the object ID.

If the text basic data contains a plurality of index IDs, whether the 'event type-event subject-start time-end time-event object' corresponding to the plurality of index IDs are the same or not is judged, if the 'event type-event subject-start time-end time-event object' corresponding to the plurality of index IDs are the same, the same index ID needs to be removed, the reserved index ID is used for generating the event ID by using an MD5 encryption algorithm, deep disambiguation on of the text basic data is completed, and text cache data unique to the event is formed, wherein the text cache data comprises one or more groups of unique index IDs and event IDs corresponding to the index IDs, namely one or more groups of 'index IDs-event IDs'.

For example, the "event type-event subject-start time-end time-event object" corresponding to three groups of index IDs extracted from the text base data of a news opinion is shown in table 7:

english field	Value 1	Value 2	Value 3
				bbd_xgxx_id	f28a1d555831272ad0a2b7b0922ca564	f28a1d555831272ad0a2b7b0922ca564	f28a1d555831272ad0a2b7b0922ca564
search_id	f28a1d555831272ad0a2b7b0922ca564_1_1	f28a1d555831272ad0a2b7b0922ca564_1_2	f28a1d555831272ad0a2b7b0922ca564_2
				event_type	Enterprise collaboration	Enterprise collaboration	Enterprise collaboration
event_subject	Number linked name plate	Number linked name plate	China Shipbuilding Heavy Industry Group Co.,Ltd.
				subject_id	17988de145c14f808fd2ffa0dc1399d7	89988de145c14f808fd2ffa0dc1399d7	89988de145c14f808fd2ffa0dc1399d7
pubdate	2015-06-15 12:15:00	2015-06-15 12:15:00	2018-09-03 00:00:00
				start_time	2018-12-26	2018-12-26	2018-08-29
end_time	2018-12-28	2018-12-28
				event_object	Yiborui	Yiborui	Harbin city government
object_id	88988de145c14f808fd2ffa0dc1399d7	88988de145c14f808fd2ffa0dc1399d7	null

TABLE 7

As can be seen from table 7, the "event type-event subject-start time-end time-event object" corresponding to the first group index ID is completely the same as the "event type-event subject-start time-end time-event object" corresponding to the second group index ID, which indicates that the event corresponding to the first group index ID and the event corresponding to the second group index ID are the same event, and therefore the same event needs to be removed, and only one event needs to be kept, that is, deep disambiguation of text base data is completed, and text base data unique to the event is formed as shown in table 8:

english field	Value 1	Value 2
			bbd_xgxx_id	f28a1d555831272ad0a2b7b0922ca564	f28a1d555831272ad0a2b7b0922ca564
search_id	f28a1d555831272ad0a2b7b0922ca564_1_1	f28a1d555831272ad0a2b7b0922ca564_2
			event_type	Enterprise collaboration	Enterprise collaboration
event_subject	Number linked name plate	China Shipbuilding Heavy Industry Group Co.,Ltd.
			subject_id	17988de145c14f808fd2ffa0dc1399d7	89988de145c14f808fd2ffa0dc1399d7
pubdate	2015-06-15 12:15:00	2018-09-03 00:00:00
			start_time	2018-12-26	2018-08-29
end_time	2018-12-28
			event_object	Yiborui	Harbin city government
object_id	88988de145c14f808fd2ffa0dc1399d7	null

TABLE 8

An index ID reserved in text basic data subjected to deep disambiguation is generated by using an MD5 encryption algorithm to generate a unique event ID corresponding to the index ID, so as to achieve the purpose of data tracing, that is, text cache data shown in table 9 is stored in a standard library for subsequent services:

english field	Chinese character field
		event_id	Event id
search_id	Index id

TABLE 9

The disambiguation processing and the ID matching are carried out on each piece of news public opinion, and the text cache data which is unique to the event corresponding to each text data finally can be output.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The method for constructing the event unique ID based on the event disambiguation is characterized in that: the method comprises the following steps:

collecting a plurality of text data, and carrying out preliminary disambiguation on the collected text data, wherein the preliminary disambiguation operation comprises the following steps: extracting a text title, a text source website name and a text release date in each text data, if the text data with the same text title, the same text data are removed and only one text data is reserved;

performing deep disambiguation on each output text basic data respectively, and outputting text cache data with a unique event, wherein the text cache data comprises a unique index ID and an event ID corresponding to the index ID; the operation of deep disambiguation comprises: if the text basic data only contains one index ID, the index ID is reserved, if the text basic data contains a plurality of index IDs, whether the event types, the event subjects, the start time, the end time and the event objects, corresponding to the index IDs, are the same or not is judged, and if the event types, the event subjects, the start time, the end time and the event objects are the same, the same index ID is removed.

2. The event disambiguation-based event unique ID constructing method according to claim 1, wherein: the preliminary disambiguation operation further comprises: the text title, the text source website name and the text release date of the reserved text data are encrypted by using an MD5 encryption algorithm to generate a related information unique ID.

3. The event disambiguation-based event unique ID constructing method according to claim 2, wherein: the step of respectively carrying out text body analysis on the plurality of text data after the preliminary disambiguation and outputting text basic data comprises the following steps of:

4. The event disambiguation-based event unique ID constructing method according to claim 3, wherein: the step of establishing an index ID according to the extracted event type, event subject, start time, end time and event object includes:

5. The event disambiguation based event unique ID building method according to any of claims 1-4, characterized by: the step of outputting the text cache data with the unique event, wherein the text cache data comprises a unique index ID and an event ID corresponding to the index ID, and the step comprises the following steps:

and generating an event ID by using an MD5 encryption algorithm on the reserved index ID, and outputting text cache data, wherein one or more groups of event ID-index IDs are included in the text cache data.