







技术领域technical field
本申请涉及自然语言处理技术领域,特别是涉及一种事件陈述文本的关键 信息提取方法、装置及应用。The present application relates to the technical field of natural language processing, and in particular, to a method, device and application for extracting key information from event statement text.
背景技术Background technique
现有技术中针对案卷中关键信息抽取的方式主要存在两方面缺 陷:一方面是业务和技术的脱节,即,对案件的关键信息抽取和检索 被认为是简单的当事人、时间、违法条例等基础信息字段查询,对无 结构案件过程信息的抽取的建模不充分导致技术只能适应初步信息 筛选,在实际业务中一线队员在查案和学案例的过程中还需要自行甄 别整理案件关键信息,耗时耗力。另一方面是技术本身的挑战,案卷 在过程信息抽取中大多依赖人工定义的语言规则,较难适应在案卷, 执法人员自然描述下半口语化的语法结构,而且各地受到方言等区域 语言特色影响,使得句法和语法特性变的不清晰,会导致自然语言处理工具在可用性上大打折扣。There are two main defects in the way of extracting key information from case files in the prior art: one is the disconnection between business and technology, that is, the extraction and retrieval of key information of a case is considered to be a simple basis for parties, time, illegal regulations, etc. Information field query, the modeling of unstructured case process information extraction is insufficient, so that the technology can only adapt to preliminary information screening. In actual business, front-line team members also need to identify and sort out the key information of the case themselves in the process of investigating and learning cases. Time consuming and laborious. On the other hand, it is the challenge of the technology itself. Most of the files in the process of information extraction rely on artificially defined language rules, which are difficult to adapt to the files. Law enforcement officers naturally describe the grammatical structure of the lower half of the colloquial language, and local languages are affected by regional language characteristics such as dialects. , making the syntactic and grammatical features unclear, which will greatly reduce the usability of natural language processing tools.
因此,亟需一种面向执法案卷关键信息抽取方法,实现对描述流程案卷信 息的自然语言的无结构化过程文本向结构化关键信息集合的转换。Therefore, there is an urgent need for a method for extracting key information from law enforcement files, which can realize the conversion of unstructured process texts in natural language describing process file information into structured key information sets.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种事件陈述文本的关键信息提取方法、装置及应用, 针对目前无法快速从历史案件中提取案例关键信息的问题,采用自动抽取方法, 实现无结构过程文本向结构化关键信息集合的转换,从而后续可用于一线执法 队员进行案件信息检索和案例学习。The embodiments of the present application provide a method, device, and application for extracting key information from event statement texts. In view of the problem that it is currently impossible to quickly extract key information of cases from historical cases, an automatic extraction method is adopted to realize the transformation from unstructured process text to structured key information. The conversion of the information collection can be used for case information retrieval and case study by front-line law enforcement team members.
第一方面,本申请实施例提供了一种事件陈述文本的关键信息提取方法, 所述方法包括:将事件陈述文本分词后输入已训练的分类器中进行识别,得到 所述已训练的分类器输出的动作实体以及对象实体;根据所述对象实体与所述 动作实体的关联映射创建待识别实体对;提取所述待识别实体对的特征向量, 将所述特征向量输入已训练的关系识别模型中,得到由所述已训练的关系识别 模型输出的目标实体对;在所述目标实体对的数量包含多个的情况下,根据每 个所述目标实体对在所述事件陈述文本中的出现顺序将多个所述目标实体对顺 序映射,得到结构化关键信息。In a first aspect, an embodiment of the present application provides a method for extracting key information from an event statement text, the method comprising: segmenting the event statement text into a trained classifier for identification, and obtaining the trained classifier Output action entity and object entity; create entity pair to be identified according to the association mapping between the object entity and the action entity; extract the feature vector of the entity pair to be identified, and input the feature vector into the trained relationship recognition model , obtain the target entity pair output by the trained relationship recognition model; in the case that the number of the target entity pair includes multiple, according to the occurrence of each target entity pair in the event statement text The sequence maps a plurality of the target entities to sequences to obtain structured key information.
在其中一些实施例中,所述已训练的分类器的训练方法包括:获取第一样 本集合,对所述第一样本集合进行标注,得到第一活动实体集以及第一属性实 体集;获取第二样本集合,基于所述活动实体集以及所述属性实体集对所述第 二样本集合自动预标注,得到第二活动实体集以及第二属性实体集;以所述第 一样本集合、所述第二样本集合作为分类器的输入,并以所述第一活动实体集、 所述第一属性实体集、所述第二活动实体集以及所述第二属性实体集作为所述 分类器的输出对所述分类器进行训练,得到所述已训练的分类器。In some of the embodiments, the training method of the trained classifier includes: acquiring a first sample set, labeling the first sample set, and obtaining a first activity entity set and a first attribute entity set; Acquiring a second sample set, automatically pre-labeling the second sample set based on the active entity set and the attribute entity set, to obtain a second active entity set and a second attribute entity set; using the first sample set , the second sample set is used as the input of the classifier, and the first active entity set, the first attribute entity set, the second active entity set and the second attribute entity set are used as the classification The classifier is trained on the output of the classifier to obtain the trained classifier.
在其中一些实施例中,“基于所述活动实体集以及所述属性实体集对所述第 二样本集合自动预标注”包括:获取所述第二样本集合中的每一未标注过程文 本,其中,所述未标注过程文本表示为:由未标注文本分词得到的词组的集合; 遍历每一所述未标注过程文本中的所述词组,将每一所述词组与所述活动实体 集以及所述属性实体集匹配:若所述词组与所述活动实体集相匹配,则将所述 词组标注为活动词组,若所述词组与所述属性实体集相匹配,则将所述词组标 注为属性词组。In some of these embodiments, "automatically pre-labeling the second set of samples based on the set of active entities and the set of attribute entities" comprises: obtaining each unlabeled procedural text in the second set of samples, wherein , the unlabeled process text is represented as: a set of phrases obtained by word segmentation of the unlabeled text; traverse the phrases in each of the unlabeled process texts, and associate each of the phrases with the active entity set and all Description attribute entity set matching: if the phrase matches the active entity set, the phrase is marked as an active phrase; if the phrase matches the attribute entity set, the phrase is marked as an attribute phrase.
在其中一些实施例中,所述方法还包括:计算所述活动词组以及所述属性 词组在每一所述未标注过程文本中所占的标注密度,剔除所述标注密度小于密 度阈值的所述未标注过程文本中的所述活动词组以及所述属性词组,得到得到 所述第二活动实体集以及所述第二属性实体集。In some of the embodiments, the method further includes: calculating the labeling density occupied by the active phrase and the attribute phrase in each of the unlabeled process texts, and excluding those whose labeling density is less than a density threshold The active phrase and the attribute phrase in the unlabeled process text are obtained to obtain the second active entity set and the second attribute entity set.
在其中一些实施例中,“根据所述对象实体与所述动作实体的关联映射创建 实体对”包括:获取每一所述动作实体与每一所述对象实体的一对一的关联关 系,得到至少一初步映射实体对;剔除所述动作实体与所述对象实体在所述事 件陈述文本中的距离不符合预设条件的所述初步映射实体对,得到待识别实体 对。In some of the embodiments, "creating an entity pair according to the association mapping between the object entity and the action entity" includes: acquiring a one-to-one association relationship between each of the action entities and each of the object entities, and obtaining At least one preliminary mapping entity pair; excluding the preliminary mapping entity pair whose distance between the action entity and the object entity in the event statement text does not meet the preset condition, to obtain the entity pair to be identified.
在其中一些实施例中,所述预设条件表征为:所述动作实体与所述对象实 体在所述事件陈述文本中的距离大于所述距离阈值。In some of these embodiments, the preset condition is characterized as: the distance between the action entity and the object entity in the event statement text is greater than the distance threshold.
在其中一些实施例中,“所述动作实体与所述对象实体在所述事件陈述文本 中的距离”的获取方法包括:根据所述动作实体在所述事件陈述文本中的第一 出现次序创建链式动作实体集,记录每一所述动作实体在所述链式动作实体集 中的动作起始位置;根据所述对象实体在所述事件陈述文本中的第二出现次序 创建链式对象实体集,记录每一所述对象实体在所述链式对象实体集中的对象 起始位置;计算所述动作起始位置与所述对象起始位置的差值,将所述差值确 定为所述动作实体与所述对象实体在所述事件陈述文本中的距离。In some of the embodiments, the method for obtaining "the distance between the action entity and the object entity in the event statement text" comprises: creating a method according to the first appearance order of the action entity in the event statement text A chain action entity set, recording the action start position of each action entity in the chain action entity set; creating a chain object entity set according to the second appearance order of the object entity in the event statement text , record the object start position of each object entity in the chained object entity set; calculate the difference between the action start position and the object start position, and determine the difference as the action The distance of the entity from the object entity in the event statement text.
在其中一些实施例中,“每个所述目标实体对在所述事件陈述文本中的出现 顺序”包括:时间顺序和/或位置顺序。In some of these embodiments, "the order of appearance of each of the target entity pairs in the event statement text" includes: chronological order and/or positional order.
第二方面,本申请实施例提供了一种事件陈述文本的关键信息提取装置, 包括:实体识别模块,用于将事件陈述文本分词后输入已训练的分类器中进行 识别,得到所述已训练的分类器输出的动作实体以及对象实体;实体关联模块, 用于根据所述对象实体与所述动作实体的关联映射创建待识别实体对;关系识 别模块,用于提取所述待识别实体对的特征向量,将所述特征向量输入已训练 的关系识别模型中,得到由所述已训练的关系识别模型输出的目标实体对;顺 序映射模块,用于在所述目标实体对的数量包含多个的情况下,根据每个所述 目标实体对在所述事件陈述文本中的出现顺序将多个所述目标实体对顺序映射, 得到结构化关键信息。In a second aspect, an embodiment of the present application provides an apparatus for extracting key information from event statement text, including: an entity recognition module, configured to segment the event statement text into a trained classifier for identification, and obtain the trained classifier. The action entity and the object entity output by the classifier; the entity association module is used to create the entity pair to be identified according to the association mapping between the object entity and the action entity; the relationship identification module is used to extract the entity pair to be identified. feature vector, inputting the feature vector into the trained relationship recognition model, and obtaining the target entity pair output by the trained relationship recognition model; the sequence mapping module is used for the number of target entity pairs to include multiple In the case of , according to the order of appearance of each target entity pair in the event statement text, the sequence of multiple target entity pairs is mapped to obtain structured key information.
第三方面,本申请实施例提供了一种电子装置,包括存储器和处理器,所 述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执 行第一方面任一项所述的事件陈述文本的关键信息提取方法。In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the first aspects The key information extraction method of the event statement text.
第四方面,本申请实施例提供了一种计算机程序产品,包括软件代码部分, 当所述计算机程序产品在计算机上被运行时,所述软件代码部分用于执行根据 第一方面任一项所述的事件陈述文本的关键信息提取方法。In a fourth aspect, an embodiment of the present application provides a computer program product, including a software code portion, and when the computer program product is run on a computer, the software code portion is used to execute the software code portion according to any one of the first aspect. A method for extracting key information from the event statement text described.
第五方面,本申请实施例提供了一种可读存储介质,所述可读存储介质中 存储有计算机程序,所述计算机程序包括用于控制过程以执行过程的程序代码, 所述过程包括根据第一方面任一项所述的事件陈述文本的关键信息提取方法。In a fifth aspect, an embodiment of the present application provides a readable storage medium, where a computer program is stored in the readable storage medium, and the computer program includes a program code for controlling a process to execute a process, and the process includes The method for extracting key information of event statement text according to any one of the first aspect.
本申请实施例的主要贡献和创新点如下:The main contributions and innovations of the embodiments of the present application are as follows:
本申请实施例通过对事件陈述文本先提取动作和对象,再获取二者关联映 射,最后分析每一“动作—对象”在文本中的顺序映射,实现抽取事件陈述文 本中的关键信息的目的。The embodiment of the present application achieves the purpose of extracting key information in the event statement text by first extracting the action and the object from the event statement text, then obtaining the associated mapping between the two, and finally analyzing the sequential mapping of each "action-object" in the text.
本申请的一个或多个实施例的细节在以下附图和描述中提出,以使本申请 的其他特征、目的和优点更加简明易懂。The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below in order to make other features, objects and advantages of the application more apparent.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分, 本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限 定。在附图中:The drawings described herein are used to provide further understanding of the present application and constitute a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation of the present application. In the attached image:
图1是根据本申请第一实施例的事件陈述文本的关键信息提取方法的主要 步骤流程图。Fig. 1 is a flow chart of main steps of a method for extracting key information from event statement text according to the first embodiment of the present application.
图2是根据本申请实施例的未标注文本或事件陈述文本的示意图。FIG. 2 is a schematic diagram of unlabeled text or event statement text according to an embodiment of the present application.
图3是案卷信息抽取框架建立过程流程图。Figure 3 is a flow chart of the establishment process of the file information extraction framework.
图4是活动/属性实体识别过程流程图。Figure 4 is a flow chart of the activity/attribute entity identification process.
图5是活动/属性关系识别过程流程图。Figure 5 is a flowchart of the activity/attribute relationship identification process.
图6是根据本申请实施例的表示活动-属性关系的示意图FIG. 6 is a schematic diagram representing an activity-attribute relationship according to an embodiment of the present application
图7是根据本申请第二实施例的事件陈述文本的关键信息提取装置的结构 框图。Fig. 7 is a structural block diagram of an apparatus for extracting key information of event statement text according to the second embodiment of the present application.
图8是根据本申请第三实施例的的电子装置的硬件结构示意图。FIG. 8 is a schematic diagram of a hardware structure of an electronic device according to a third embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描 述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。 以下示例性实施例中所描述的实施方式并不代表与本说明书一个或多个实施例 相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本 说明书一个或多个实施例的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with one or more examples of this specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of one or more embodiments of this specification, as recited in the appended claims.
需要说明的是:在其他实施例中并不一定按照本说明书示出和描述的顺序 来执行相应方法的步骤。在一些其他实施例中,其方法所包括的步骤可以比本 说明书所描述的更多或更少。此外,本说明书中所描述的单个步骤,在其他实 施例中可能被分解为多个步骤进行描述;而本说明书中所描述的多个步骤,在 其他实施例中也可能被合并为单个步骤进行描述。It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may comprise more or fewer steps than those described in this specification. In addition, a single step described in this specification may be decomposed into multiple steps for description in other embodiments; and multiple steps described in this specification may also be combined into a single step in other embodiments. describe.
本方案可用于对城管案件进行梳理,城管案件的记录方式一般由多个执法 人员在不同平台上上报得到,此外,城管案件还可以由民众举报或投诉得到, 因此城管案件与传统的结构化事件信息相比,有着非常灵活多变的语法结构。 基于此,本方案通过从事件陈述文本中提取关键信息,即建立抽取框架将描述 流程案卷的自然语言的无结构化过程文本转换成结构化关键信息,使得事件相 关人员能够基于结构化关键信息进行后续的处理。This scheme can be used to sort out urban management cases. The recording methods of urban management cases are generally reported by multiple law enforcement officers on different platforms. In addition, urban management cases can also be reported or complained by the public. Therefore, urban management cases are different from traditional structured events. Compared with information, it has a very flexible and changeable grammatical structure. Based on this, this solution converts the unstructured process text in natural language describing the process file into structured key information by extracting key information from the event statement text, that is, establishing an extraction framework, so that event-related personnel can conduct research based on the structured key information. subsequent processing.
当然,本方案还不局限与对城管案件进行梳理,例如执法案件、调查问卷 以及法律文书等,均可以采用本方案的抽取方式进行关键信息的获取,本方案 在此方面并无限制。Of course, this scheme is not limited to sorting out urban management cases. For example, law enforcement cases, questionnaires and legal documents, etc., can use the extraction method of this scheme to obtain key information. This scheme has no restrictions in this respect.
下面以本方案应用于对城管案件进行梳理为例对本申请进行解释说明。Hereinafter, the application will be explained by taking the application of this solution to sorting out urban management cases as an example.
图1是根据本申请第一实施例的事件陈述文本的关键信息提取方法的主要 步骤流程图。Fig. 1 is a flow chart of main steps of a method for extracting key information from event statement text according to the first embodiment of the present application.
为实现该目的,如图1所示,事件陈述文本的关键信息提取方法主要包括 如下的步骤S101至步骤S104。To achieve this purpose, as shown in Fig. 1, the method for extracting key information of event statement text mainly includes the following steps S101 to S104.
步骤S101、将事件陈述文本分词后输入已训练的分类器中进行识别,得到 所述已训练的分类器输出的动作实体以及对象实体。Step S101, input the event statement text into a trained classifier for identification after word segmentation, and obtain the action entity and the object entity output by the trained classifier.
在步骤S101中,事件陈述文本是指未标注的陈述事件发生过程的文本,以 城管案件为例,如图2所示,城管案件的事实陈述文本中主要记录了违法事实 及造成的危害和影响,该文本是基于自然语言的无结构化过程文本。In step S101, the event statement text refers to the unmarked text stating the occurrence process of the event. Taking the urban management case as an example, as shown in Figure 2, the factual statement text of the urban management case mainly records the illegal facts and the harm and impact caused. , which is an unstructured procedural text based on natural language.
对事件陈述文本进行分词,例如采用常用的“基于字符串匹配的算法”通 过一定策略将待切分的汉字串与词典中的词条一一比对,比对成功则分词的方 式;或者采用常用的“基于统计的方法”,通过对语料中相邻共现的各个字的组 合频度进行统计,计算它们的互现信息,将互现信息高于阈值的词作为一个词; 或者“基于隐马尔可夫模型的分词算法”;或者“基于条件随机场的分词算法” 等等。示例性地,对图2中“发现当事人王XX正在该处倾倒装修垃圾”采用上 述任一算法进行分词,可以得到“发现/当事/人/王XX/正在/该处/倾/倒/装修/ 垃圾”,将分词得到的十个词组输入已训练的分类器中,由已训练的分类器对表 示动作的词组以及表示对象的词组进行识别,最终输出动作实体以及对象实体。To segment the event statement text, for example, using the commonly used "string matching-based algorithm" to compare the Chinese character string to be segmented with the entry in the dictionary one by one through a certain strategy, and the method of word segmentation if the comparison is successful; or The commonly used "statistics-based method" calculates the frequency of combinations of adjacent co-occurrence words in the corpus, calculates their mutual occurrence information, and treats words whose mutual occurrence information is higher than the threshold as a word; Word Segmentation Algorithm for Hidden Markov Models"; or "Word Segmentation Algorithm Based on Conditional Random Fields" and so on. Exemplarily, using any of the above algorithms to perform word segmentation for "found that the party Wang XX is dumping decoration waste there" in Figure 2, we can get "discovery / party / person / Wang XX / is / there / dumping / dumping / Decoration/garbage", input the ten phrases obtained by word segmentation into the trained classifier, and the trained classifier will recognize the phrases representing actions and the phrases representing objects, and finally output the action entity and the object entity.
具体地,再次参阅图2,自然文本中蕴含的案例信息主要有两个特点:第一 案例中每个事件的活动主要通过句子中的动词词组表现,例如图2中的“经查”、 “倾倒”分别是一个活动;第二,单独的动词不具有完整的语义,其需与相应 的主语或宾语共同构成一个有意义的事件。例如图2中的“倾倒”与“装修垃 圾”、“驾驶”与“电动三轮车”分别构成一个完整的事件。因此,为了最终获 得完整的事件,本步骤先将构成事件的词组从自然文本中提取出来,并分类得 到动作实体以及对象实体。Specifically, referring to Figure 2 again, the case information contained in the natural text mainly has two characteristics: the activity of each event in the first case is mainly expressed by the verb phrases in the sentence, such as "jingcha", "" in Figure 2. "Dumping" is an activity respectively; secondly, the verb alone does not have complete semantics, and it needs to form a meaningful event together with the corresponding subject or object. For example, "dumping" and "decoration garbage", "driving" and "electric tricycle" in Figure 2 constitute a complete event respectively. Therefore, in order to finally obtain the complete event, this step first extracts the phrases that constitute the event from the natural text, and classifies it to obtain the action entity and the object entity.
针对上述步骤S101,本方案将事件陈述文本分词得到由词组组成的过程文 本,表示为T=(W1,W2,W3,…,Wn),其中,W1,W2,…,Wn是指一个个词组, 将过程文本输入至已训练的分类器中进行识别,输出表示动作的动作实体以及 表示对象的对象实体。本步骤根据自然文本蕴含的案例信息中具备的“事件以 动词词组表现”以及“动词需与主语或并与共同构成一个完整事件”的两个特 点从而在无结构化过程文本中抽取框架。For the above-mentioned step S101, this solution divides the event statement text into words to obtain a process text composed of phrases, which is expressed as T=(W1 , W2 , W3 ,...,Wn ), where W1 , W2 ,..., Wn refers to each phrase, input the process text into the trained classifier for recognition, and output the action entity representing the action and the object entity representing the object. In this step, the frame is extracted from the unstructured process text according to the two characteristics of the case information contained in the natural text: "events are represented by verb phrases" and "verbs need to be combined with the subject or together to form a complete event".
步骤S102、根据所述对象实体与所述动作实体的关联映射创建待识别实体 对。Step S102: Create an entity pair to be identified according to the association mapping between the object entity and the action entity.
在该步骤中,一段自然文本中可能会包含多个对象实体以及动作实体,将 多个对象实体用集合表示,将多个动作实体用集合表示,创建两个集合间元素 的关联映射。如存储动作实体的集合A{a1、a2、...、an},存储对象实体的集 合P{p1、p2、...、pn},创建关联映射指的是a1->p1,a1->p2,...,an->pn。 在本步骤中,采用关联值表示两个实体间有无关联关系,例如,frel:A×P→ {1,0},其中,“×”表示集合A与集合P的笛卡尔乘积,假设A为(a1,a2),B 为(b1,b2),那么A×B为(a1,b1),(a1,b2),(a2,b1),(a2,b2),“->” 表示映射。若frel(a,p)=1表示p是a关联的一个属性,0表示两者无关。In this step, a piece of natural text may contain multiple object entities and action entities, multiple object entities are represented by sets, and multiple action entities are represented by sets, and an association mapping of elements between the two sets is created. For example, to store the set of action entities A{a1, a2, ..., an}, and to store the set of object entities P{p1, p2, ..., pn}, creating an association map refers to a1->p1, a1- >p2, ..., an->pn. In this step, an association value is used to indicate whether there is an association between two entities, for example, frel : A×P→{1,0}, where "×" indicates the Cartesian product of the set A and the set P, assuming A is (a1, a2), B is (b1, b2), then A×B is (a1, b1), (a1, b2), (a2, b1), (a2, b2), “->” means map. If frel (a, p)=1, it means that p is an attribute associated with a, and 0 means that the two are irrelevant.
针对上述步骤S102,本方案将动作实体以及对象实体进行关联,得到待识 别实体对,在待识别实体对中既包含了“倾倒”与“装修垃圾”这种相关联的 实体对,也包含了“驾驶”与“纸板箱”这种实际不相关联的实体对。在本步 骤中,先将这些实际相关联与实际不相关联的实体对都表示出来,得到待识别 实体对,在后续步骤中通过对待识别实体对进行识别从而获取能构成正确事件 的实体对。For the above-mentioned step S102, this scheme associates the action entity and the object entity to obtain the entity pair to be identified. The entity pair to be identified includes not only the associated entity pair such as "dumping" and "decoration garbage", but also a pair of entities to be identified. The pair of entities "drive" and "cardboard box" are actually unrelated. In this step, the entity pairs that are actually associated and not actually associated are first represented to obtain the entity pairs to be identified. In the subsequent steps, the entity pairs that can constitute the correct event are obtained by identifying the entity pairs to be identified.
步骤S103、提取所述待识别实体对的特征向量,将所述特征向量输入已训 练的关系识别模型中,得到由所述已训练的关系识别模型输出的目标实体对。Step S103, extracting the feature vector of the entity pair to be recognized, and inputting the feature vector into the trained relationship recognition model to obtain the target entity pair output by the trained relationship recognition model.
在本步骤中,已训练的关系识别模型可以为TSVM模型。TSVM模型人工标记 的实体对的关联关系以及未标记样本训练得到,TSVM模型能够对活动/属性关系 进行识别。将待识别实体对输入TSVM模型中,输入能构成正确事件的目标实体 对,即,模型会自动剔除类似于“驾驶”与“纸箱板”的实际不相关联的实体 对,将实际相关联的实体对的识别结果作为输出。In this step, the trained relationship recognition model may be a TSVM model. The association relationship between the manually labeled entity pairs and the unlabeled samples are trained by the TSVM model, and the TSVM model can identify the activity/attribute relationship. Input the entity pair to be recognized into the TSVM model, and input the target entity pair that can constitute the correct event, that is, the model will automatically eliminate the actual unrelated entity pair similar to "driving" and "carton board", and will actually related entity pairs. The recognition result of the entity pair is used as output.
在本步骤中,TSVM模型为常规结构,对TSVM进行训练时,先采集某过程文 本集对应的实体对数据集为E=(X1,X2,X3,…,Xn),其中Xi∈Ry(i=1, …,n)为一个实体对特征向量,y为一个实体对的特征数量。人工对实体对数 据集进行标注,令标注的部分为Elab=(X1,X2,X3,…,Xl),标注结果为 Mlab=(M1,…,Ml),其中Mi∈{+1,-1}(i=1,…,n),未标注的部分为 Eunlab=(Xl+1,Xl+2,…,Xn),然后以未标注的实体对数据集作为输入,以标 注结果作为输出TSVM进行模型训练,得到已训练的关系识别模型。In this step, the TSVM model is a conventional structure. When training the TSVM, the entity pair data set corresponding to a certain process text set is first collected as E=(X1 , X2 , X3 ,...,Xn ), where Xi ∈ Ry (i=1, …, n) is the feature vector of an entity pair, and y is the number of features of an entity pair. Manually label the entity to the dataset, let the labelled part be Elab =(X1 ,X2 ,X3 ,...,Xl ), and the labeling result is Mlab =(M1 ,...,Ml ), where Mi ∈{+1,-1}(i=1,…,n), the unlabeled part is Eunlab =(Xl+1 ,Xl+2 ,…,Xn ), then the unlabeled part is E unlab =(X l+1 ,X l+2 ,…,X n ) The entity pair dataset is used as the input, and the annotation result is used as the output TSVM for model training, and the trained relation recognition model is obtained.
针对上述步骤S103,本方案通过对待识别实体对进行自动识别从而获取能 构成正确事件的目标实体对,目标实体对中包含实际相关联的动作实体与对象 实体。For the above-mentioned step S103, the scheme obtains the target entity pair that can constitute the correct event by automatically identifying the entity pair to be identified, and the target entity pair includes the actually associated action entity and object entity.
步骤S104、在所述目标实体对的数量包含多个的情况下,根据每个所述目 标实体对在所述事件陈述文本中的出现顺序将多个所述目标实体对顺序映射, 得到结构化关键信息。Step S104, in the case that the number of the target entity pairs includes a plurality of pairs, map the order of the plurality of target entity pairs according to the appearance order of each target entity pair in the event statement text to obtain a structured Key Information.
在本步骤中,若目标实体对的数量为一个,则将该目标实体对作为结构化 关键信息。In this step, if the number of target entity pairs is one, the target entity pair is used as structured key information.
在步骤S104中,通过分析每一“动作-对象”在事件陈述文本中的顺序映 射实现整合多条目标实体的信息的目的。In step S104, the purpose of integrating the information of multiple target entities is achieved by analyzing the sequential mapping of each "action-object" in the event statement text.
综上,针对上述步骤S101至步骤S104,本方案通过对事件陈述文本先提取 动作和对象,再获取二者关联映射,最后分析每一“动作—对象”在文本中的 顺序映射,实现抽取事件陈述文本中的关键信息的目的。To sum up, for the above steps S101 to S104, this scheme extracts the event by first extracting the action and the object from the event statement text, then obtaining the associated mapping between the two, and finally analyzing the sequential mapping of each "action-object" in the text to extract the event. State the purpose of the key information in the text.
下面以一个具体示例说明采用本方案第一实施例的步骤对事件陈述文本中 的关键信息进行提取地处理过程。The following uses a specific example to describe the process of extracting the key information in the event statement text using the steps of the first embodiment of this solution.
参考图3至图5,处理过程包括:案卷信息抽取框架建立、活动及属性实体 识别、活动与属性的关系识别以及活动顺序关系抽取。Referring to Figures 3 to 5, the processing process includes: the establishment of a file information extraction framework, the identification of activity and attribute entities, the identification of the relationship between activities and attributes, and the extraction of activity sequence relationships.
具体地,按照“活动”与“属性”共同构成完整事件机理对案卷信息抽取 中涉及的基本元素进行定义:Specifically, the basic elements involved in the extraction of case file information are defined according to the mechanism that "activity" and "attribute" together constitute a complete event:
S301、对案卷过程文本进行定义。案卷过程文本由事实陈述文本分词得到, 案卷过程文本由词串得到,表示为T=(W1,W2,W3,…,Wn),其中,W1,W2,…,Wn是指一个个词。S301. Define the file process text. The case file process text is obtained from the word segmentation of the fact statement text, and the case file process text is obtained from the word string, expressed as T=(W1 ,W2 ,W3 ,...,Wn ), where W1 ,W2 ,...,Wn means a word.
S302、对过程文本的长度进行定义。将过程文本的长度表示为Tlen=n,过 程文本的子串为T[S,L]=(WS,WS+1,…,WS+L-1),其中S表示起始位置, L表示长度。S302. Define the length of the process text. The length of the process text is represented as Tlen =n, and the substring of the process text is T[S, L]=(WS , WS+1 , . . . , WS+L-1 ), where S represents the starting position , L is the length.
S303、对过程文本中的实体进行定义。过程文本T中的一个实体a=<ID,S,L>, 备注:ID为实体的唯一标识,ID≥0,1≤S≤Tlen,1≤L≤Tlen-S+1。S303, define entities in the process text. An entity a=<ID, S, L> in the process text T, Remarks: ID is the unique identifier of the entity, ID≥0, 1≤S≤Tlen , 1≤L≤Tlen -S+1.
S304、对实体进一步细分为活动实体和属性实体。活动实体对应的过程文 本子串是动词词组,属性实体对应的文本子串是名词词组。使用点号操作符“.” 表示实体的字段,xi.S表示某活动/属性实体xi在过程文本中的起始位置。S304, the entity is further subdivided into an activity entity and an attribute entity. The process text substring corresponding to the activity entity is a verb phrase, and the text substring corresponding to the attribute entity is a noun phrase. Use the dot operator "." to indicate the field of the entity, and xi .S to indicate the starting position of an activity/attribute entity xi in the process text.
举例说明:以图1的案件过程文本为例,“倾倒”、“驾驶”分别是一个活动 实体,其S取值为所在文本中的位置。“驾驶”通常被中文分词工具处理为一个 词,故L=1,如果被中文分词工具处理为两个词,例如“倾倒”可以理解为“倾” 和“倒”,那么L=2。“装修垃圾”、“电动三轮车”是属性实体,其S和L取值由 文本分词结果决定。For example: Take the case process text in Figure 1 as an example, "dumping" and "driving" are respectively an active entity, and the value of S is the position in the text. "Driving" is usually processed as one word by the Chinese word segmentation tool, so L=1. If it is processed as two words by the Chinese word segmentation tool, for example, "dumping" can be understood as "dumping" and "dumping", then L=2. "Renovation garbage" and "electric tricycle" are attribute entities, and the values of S and L are determined by the result of text segmentation.
S305、对过程文本中的案例进行文本案例定义。案卷过程文本的案例抽取 任务在本质上是从一则过程文本T中获得相应的案例C。C=<A,P,frel,fseq>。S305, define a text case for the case in the process text. The case extraction task of case file process text is essentially to obtain the corresponding case C from a process text T. C=<A, P, frel , fseq >.
S306、对案例相关的所有活动和属性进行定义。每个活动实体和属性实体 在过程文本中对应的片段互不重叠。A为活动集合,P为属性集合,需要满足条 件:对于S306. Define all activities and attributes related to the case. The corresponding fragments of each activity entity and attribute entity in the process text do not overlap each other. A is the active set, P is the attribute set, which needs to meet the conditions: for
S307、对活动与属性之间的关联映射进行定义。frel:A×P→{1,0},如果 frel(a,p)=1表示p是a关联的一个属性,0表示两者无关。S307 , define the association mapping between the activity and the attribute. frel : A×P→{1,0}, if frel (a, p)=1, it means that p is an attribute associated with a, and 0 means that the two are irrelevant.
S308、对活动之间的链式顺序进行定义。fseq:A→{1,…,A},其中fseq是一 个双射关系,fseq(a1)=fseq(a2)-1。S308, define the chain sequence between the activities. fseq : A→{1,...,A}, where fseq is a bijective relation, fseq (a1 )=fseq (a2 )-1.
对活动以及属性实体进行识别,具体地,可以将活动/属性实体识别任务抽 象为一个序列标注问题,即对过程文本中的每个词附以标签,根据标签提取各 个活动和属性实体。序列标注问题可以使用条件随机场等传统的统计学习方法 解决,但是这类方法需要一定规模的标注数据,因此在应用于活动/属性实体 识别任务时存在一定局限。为此设计了一种半监督的活动/属性实体识别方法, 通过利用大量的未标注的原始过程文本弥补标注文本的不足:To identify activity and attribute entities, specifically, the task of activity/attribute entity recognition can be abstracted as a sequence labeling problem, that is, each word in the process text is labeled, and each activity and attribute entity is extracted according to the label. The sequence labeling problem can be solved by traditional statistical learning methods such as conditional random fields, but such methods require a certain scale of labeling data, so there are certain limitations when applied to activity/attribute entity recognition tasks. To this end, a semi-supervised activity/attribute entity recognition method is designed, which makes up for the lack of annotated text by utilizing a large amount of unlabeled raw process text:
S401、对少量过程文本人工标注。序列标注任务的目的是给定过程文本 T=(W1,W2,W3,…,Wn),得到对应的标注序列M=(M1,M2,M3,…,Mn)∈Markn,其中Mark是标签空间。标注对照表如下所示,表1为过程文本标准实 例。S401. Manually mark a small amount of process text. The purpose of the sequence labeling task is to give the process text T=(W1 ,W2 ,W3 ,…,Wn ), and get the corresponding labeling sequence M=(M1 ,M2 ,M3 ,…,Mn ) ∈ Markn , where Mark is the label space. The labeling comparison table is shown below, and Table 1 is an example of the process text standard.
表1-标注对照表Table 1 - Labeling comparison table
基于标注对照表对过程文本人工标注,得到标注结果,如表2所示,表2 为标注示例表,记录的是对图2示例中的“发现当事人王XX正在该处倾倒装修 垃圾”的标注示例。Manually label the process text based on the labeling comparison table, and obtain the labeling result, as shown in Table 2. Table 2 is an example labeling table, which records the labeling of "the party Wang XX is found to be dumping decoration garbage there" in the example of Figure 2 Example.
表2-标注示例表Table 2 - Annotation example table
S402、对海量过程文本自动预标注。S402. Automatically pre-mark the massive process text.
由于时间等成本原因,人工方式标注量有限。为了充分利用有限的标注文 本,需充分利用标注文本对未标注过程文本经常会出现相同的活动和属性实体 进行自动预标注。以表2所示的案卷文本为例,“倾倒”是很多渣土违法行为过 程中的必要动作,“垃圾”同样会在渣土类违法行为过程中作为属性出现。因此, 可以从已标注的实例中对这些词进行预标注,构建实体列表。Due to cost reasons such as time, the amount of manual annotation is limited. In order to make full use of the limited annotation text, it is necessary to make full use of the annotation text to automatically pre-annotate the same activity and attribute entities that often appear in the unlabeled process text. Taking the case file text shown in Table 2 as an example, "dumping" is a necessary action in the process of many muck violations, and "garbage" also appears as an attribute in the process of muck violations. Therefore, these words can be pre-labeled from the labeled instances to build entity lists.
举例说明:对于某给定的未标注过程文本T=(W1,W2,W3,…,Wn),如Wi(i=1,2,3,…,n)出现在上述实体列表中,则被直接标注为相应的活动或属性标 签Mi,否则Mi=N/A。需要说明的是,预标注过程文本和人工标注过程文本的重 要区别是具有不同的标签空间,即人工标签空间MR={AB,AC,AS,PB,PC,PS,O}, 而预标注标签空间MD={AB,AC,AS,PB,PC,PS,N/A}。为了融合这两种标注结果, 需将这两个标签空间进行统一。因为MD=中的N/A标签表示不确定标注结果, 但其实是Mark中的任意一个,所以将预标注结果中的U标签替换为For example: for a given unlabeled process text T=(W1 ,W2 ,W3 ,...,Wn ), such as Wi (i =1,2,3,...,n) appears in the above entity In the list, it is directly marked as the corresponding activity or attribute label Mi , otherwise Mi =N/A. It should be noted that the important difference between the pre-labeling process text and the manual labeling process text is that they have different label spaces, that is, the manual label space MR={AB,AC,AS,PB,PC,PS,O}, while the pre-labeled label space is MR={AB,AC,AS,PB,PC,PS,O}. Space MD={AB,AC,AS,PB,PC,PS,N/A}. In order to fuse the two labeling results, the two label spaces need to be unified. Because the N/A label in MD= indicates an uncertain labeling result, but it is actually any one of the Marks, so replace the U label in the pre-labeling result with
S403、使用半马可夫条件随机场模型训练。假设人工标注过程文本数据集 为<t,m>,其中t=(T1,T2,T3,…,Tn)为过程文本集合, m=(M1,M2,M3,…,Mn)为对应的人工标注结果,预标注过程文本数据集为<t′,m′>,t′=(T′1,T′2,T′3,…,T′x)为得到的预标 注过程文本集,m′=(M′1,M′2,M′3,…,M′x)为预标注经对齐后的标注结果, 将过程文本集合t、预标注过程文本集t’作为模型的输入,将人工标注结果m、预标注经对齐的标注结构m’作为模型的输出对模型进行训练。S403, using the semi-Markov conditional random field model for training. Suppose the manual annotation process text dataset is <t,m>, where t=(T1 ,T2 ,T3 ,...,Tn ) is the process text set, m=(M1 ,M2 ,M3 ,... , Mn ) is the corresponding manual annotation result, The pre-labeling process text data set is <t',m'>, t'=(T'1 ,T'2 ,T'3 ,...,T'x ) is the obtained pre-labeling process text set, m'=( M′1 , M′2 , M′3 ,...,M′x ) are the alignment results of pre-labeling, the process text set t and pre-label process text set t' are used as the input of the model, and the manual labeling results are m. Pre-annotation The aligned annotation structure m' is used as the output of the model to train the model.
活动/属性关系识别任务可以抽象为面向〈活动,属性〉实体对的二值分 类问题,分类目标是某实体对之间存在关系或不存在关系。图3所示为一个表 示活动-属性关系的关系识别实例。参考图6,〈倾倒,装修垃圾〉等实体对之间 存在关系,对应的类别为+1;〈正在,当事人王XX〉等实体对之间不存在关系, 对应的类别为-1。与实体识别任务类似,在领域过程文本上也缺乏可用的关系 标注数据。为了降低对人工标注数据的依赖,并充分利用同领域内大量的无标 注过程文本,基于转导推理(也称为直推式学习(TransductiveLearning)设计 〈活动,属性〉实体对分类方法,主要包括实体对数据生成、转导支持向量机模型训练两个步骤:The activity/attribute relationship recognition task can be abstracted as a binary classification problem oriented to the entity pair <activity, attribute>, and the classification goal is the existence or nonexistence of a relationship between a certain entity pair. Figure 3 shows an example of relationship recognition representing activity-attribute relationships. Referring to Figure 6, there is a relationship between entity pairs such as "dumping, decoration garbage", and the corresponding category is +1; there is no relationship between entity pairs such as "being, party Wang XX", and the corresponding category is -1. Similar to entity recognition tasks, there is also a lack of available relational annotation data on domain process texts. In order to reduce the dependence on manually annotated data and make full use of a large number of unlabeled process texts in the same field, a method for classifying entity pairs based on transductive reasoning (also known as Transductive Learning) is designed, which mainly includes Entity pair data generation, transduction support vector machine model training two steps:
S501、实体对数据生成。基于实体对分类的活动/属性关系识别方法的首 要环节,是将实体识别后的过程文本转换为实体对集合,表示为:A->P。S501. Entity pair data generation. The first step of the activity/attribute relationship recognition method based on entity pair classification is to convert the process text after entity recognition into a set of entity pairs, which is expressed as: A->P.
S502、构建活动/属性关系识别模型。以未标注的实体对数据集作为输入, 以标注结果作为输出TSVM进行模型训练,得到已训练的关系识别模型。S502. Build an activity/attribute relationship identification model. The unlabeled entity pair dataset is used as the input, and the labeled result is used as the output TSVM for model training, and the trained relation recognition model is obtained.
活动顺序关系抽取的目标是将识别到的所有活动按执行关系进行排序。因 为过程文本大多属于叙事文体,其主要目标是记录特定领域事件的发生过程, 如调查过程、执法过程等,所以往往采用流水账式的顺述手法,很少采用倒叙、 插叙等特殊的记叙方式。可以简单认为,过程文本中各个活动的出现顺序基本 反映了案例中对应事件的发生顺序,所以依次给其顺序赋值。The goal of activity sequence relation extraction is to sort all identified activities by execution relation. Because most of the process texts belong to the narrative style, and their main goal is to record the occurrence process of events in a specific field, such as the investigation process, the law enforcement process, etc., so they often use a running account-style narrative method, and rarely use flashbacks, interludes and other special narrative methods. It can be simply considered that the order of occurrence of each activity in the process text basically reflects the order of occurrence of the corresponding events in the case, so assign values to the order in turn.
在其中一个可行实施例中,所述已训练的分类器的训练方法包括:获取第 一样本集合,对所述第一样本集合进行标注,得到第一活动实体集以及第一属 性实体集;获取第二样本集合,基于所述活动实体集以及所述属性实体集对所 述第二样本集合自动预标注,得到第二活动实体集以及第二属性实体集;以所 述第一样本集合、所述第二样本集合作为分类器的输入,并以所述第一活动实 体集、所述第一属性实体集、所述第二活动实体集以及所述第二属性实体集作 为所述分类器的输出对所述分类器进行训练,得到所述已训练的分类器。In one possible embodiment, the training method for the trained classifier includes: acquiring a first sample set, labeling the first sample set, and obtaining a first active entity set and a first attribute entity set Obtain a second sample set, automatically pre-label the second sample set based on the active entity set and the attribute entity set, and obtain a second active entity set and a second attribute entity set; take the first sample set, the second sample set as the input of the classifier, and the first active entity set, the first attribute entity set, the second active entity set and the second attribute entity set as the The output of the classifier trains the classifier, resulting in the trained classifier.
具体地,在该实施例中,“基于所述活动实体集以及所述属性实体集对所 述第二样本集合自动预标注”包括:获取所述第二样本集合中的每一未标注过 程文本,其中,所述未标注过程文本表示为:由未标注文本分词得到的词组的 集合;遍历每一所述未标注过程文本中的所述词组,将每一所述词组与所述活 动实体集以及所述属性实体集匹配:若所述词组与所述活动实体集相匹配,则 将所述词组标注为活动词组,若所述词组与所述属性实体集相匹配,则将所述 词组标注为属性词组。Specifically, in this embodiment, "automatically pre-labeling the second sample set based on the active entity set and the attribute entity set" includes: acquiring each unlabeled process text in the second sample set , wherein the unlabeled process text is represented as: a set of phrases obtained by word segmentation of the unlabeled text; traverse the phrases in each of the unlabeled process texts, and associate each of the phrases with the active entity set And the attribute entity set matching: if the phrase matches the active entity set, the phrase is marked as an active phrase, and if the phrase matches the attribute entity set, the phrase is marked is an attribute phrase.
在本实施例中采用半马可夫条件随机场模型作为分类器,对半马可夫条件 随机场模型进行训练得到已训练的分类器,具体的训练步骤已在前述中进行解 释,在此不再重复累赘说明。In this embodiment, the semi-Markov conditional random field model is used as the classifier, and the trained classifier is obtained by training the semi-Markov conditional random field model. The specific training steps have been explained above, and the redundant description will not be repeated here. .
在其中一个可行实施例中,所述方法还包括:计算所述活动词组以及所述 属性词组在每一所述未标注过程文本中所占的标注密度,剔除所述标注密度小 于密度阈值的所述未标注过程文本中的所述活动词组以及所述属性词组,得到 得到所述第二活动实体集以及所述第二属性实体集。In one possible embodiment, the method further includes: calculating the labeling density occupied by the active phrase and the attribute phrase in each of the unlabeled process texts, and removing all the labeling densities less than a density threshold. The active phrase and the attribute phrase in the unlabeled process text are described to obtain the second active entity set and the second attribute entity set.
具体地,在本实施例中设置密度阈值并通过密度阈值剔除标注密度未达到 要求的活动词组或者属性词组。Specifically, in this embodiment, a density threshold is set, and the active phrases or attribute phrases whose labeling density does not meet the requirements are eliminated through the density threshold.
例如,计算过程文本T的标注密度D(T):For example, to calculate the annotation density D(T) of the procedural text T:
其中 Mi∈{AB,AC,AS,PB,PC,PS}条件满足时取值为1,否则取值为0。设定密度阈 值为DL,如果D(T)>DL,则将T选入预标注过程文本集。 Among them, Mi ∈ {AB, AC, AS, PB, PC, PS} takes the value 1 when the condition is satisfied, and 0 otherwise. The density threshold is set to DL. If D(T)>DL, T is selected into the pre-labeling process text set.
针对上述实施例,本方案采用人工标注少量样本,并基于人工标注结果对 海量样本进行预标注,且通过剔除标注密度未达到密度阈值来获取预标注过程 文本集。采用本方案的半监督的活动/属性实体识别方法,能够实现通过利用 大量的未标注的原始过程文本弥补标注文本的不足的有益效果。In view of the above-mentioned embodiment, this scheme adopts manual labeling of a small number of samples, and pre-labels a large number of samples based on the manual labeling results, and obtains the pre-labeling process text set by eliminating the labeling density that does not reach the density threshold. Adopting the semi-supervised activity/attribute entity recognition method of this scheme can achieve the beneficial effect of making up for the lack of annotated text by using a large amount of unlabeled original process text.
在其中一个可行实施例中,“根据所述对象实体与所述动作实体的关联映 射创建实体对”包括:获取每一所述动作实体与每一所述对象实体的一对一的 关联关系,得到至少一初步映射实体对;剔除所述动作实体与所述对象实体在 所述事件陈述文本中的距离不符合预设条件的所述初步映射实体对,得到待识 别实体对。In one possible embodiment, "creating an entity pair according to the association mapping between the object entity and the action entity" includes: acquiring a one-to-one association relationship between each of the action entities and each of the object entities, Obtaining at least one preliminary mapping entity pair; excluding the preliminary mapping entity pair whose distance between the action entity and the object entity in the event statement text does not meet the preset condition, to obtain the entity pair to be identified.
具体地,从过程文本中提取实体对的主要挑战是候选实体对数量非常庞大, 一个包含x个活动实体和y个属性实体的过程文本中包含的实体对数量为x×y, 而其中仅有少量是存在关系的实体对,这将给后续的分类算法的有效应用带来 挑战。考虑到在文本中距离较远的活动和属性之间通常不会存在关系,在该实 施例中,所述预设条件表征为:所述动作实体与所述对象实体在所述事件陈述 文本中的距离大于所述距离阈值,即,设置距离阈值,对距离超过阈值的实体 对进行过滤。Specifically, the main challenge in extracting entity pairs from process texts is that the number of candidate entity pairs is very large. The number of entity pairs contained in a process text containing x active entities and y attribute entities is x × y, and only A small number are entity pairs with relationships, which will bring challenges to the effective application of subsequent classification algorithms. Considering that there is usually no relationship between far-distance activities and attributes in the text, in this embodiment, the preset condition is characterized as: the action entity and the object entity are in the event statement text The distance of is greater than the distance threshold, that is, the distance threshold is set, and the entity pairs whose distance exceeds the threshold are filtered.
在其中一个可行实施例中,“所述动作实体与所述对象实体在所述事件陈 述文本中的距离”的获取方法包括:根据所述动作实体在所述事件陈述文本中 的第一出现次序创建链式动作实体集,记录每一所述动作实体在所述链式动作 实体集中的动作起始位置;根据所述对象实体在所述事件陈述文本中的第二出 现次序创建链式对象实体集,记录每一所述对象实体在所述链式对象实体集中 的对象起始位置;计算所述动作起始位置与所述对象起始位置的差值,将所述 差值确定为所述动作实体与所述对象实体在所述事件陈述文本中的距离。In one possible embodiment, the method for obtaining "the distance between the action entity and the object entity in the event statement text" includes: according to the first appearance order of the action entity in the event statement text Create a chain action entity set, record the action start position of each action entity in the chain action entity set; create a chain object entity according to the second appearance order of the object entity in the event statement text set, record the object starting position of each object entity in the chained object entity set; calculate the difference between the action starting position and the object starting position, and determine the difference as the The distance between the action entity and the object entity in the event statement text.
具体地,在本实施例中链式动作实体集表示集合中存储的是指针的顺序关 系的动作实体,集合中动作实体的动作起始位置从0开始递增;同理链式对象 实体集表示集合中存储的是指针的顺序关系的对象实体,集合中对象实体的对 象起始位置从0开始递增。例如A{倾倒、驾驶、},P{装修垃圾、车},则倾倒 的动作起始位置为0,车的对象起始位置为1,二者差值为1,若大于距离阈值, 根据frel:A×P→{1,0},关联值为0,因此剔除“倾倒”和“车”的关联关系; 若小于距离阈值,则保留“倾倒”和“车”的关联关系。Specifically, in this embodiment, a chained action entity set represents an action entity that stores a sequence relationship of pointers in the set, and the action start position of the action entity in the set increases from 0; similarly, a chained object entity set represents a set Stores the object entities of the order relationship of the pointers, and the object starting position of the object entities in the collection starts from 0 and increases. For example, A{dumping, driving,}, P{decoration garbage, car}, the starting position of the dumping action is 0, the starting position of the object of the car is 1, and the difference between the two is 1. If it is greater than the distance threshold, according to frel : A×P→{1,0}, the association value is 0, so the association between "dumping" and "car" is eliminated; if it is less than the distance threshold, the association between "dumping" and "car" is retained.
在其中一个可行实施例中,“每个所述目标实体对在所述事件陈述文本中 的出现顺序”包括:时间顺序和/或位置顺序。In one of the possible embodiments, "the appearance order of each target entity pair in the event statement text" includes: time order and/or position order.
该实施例与第一实施例中采用顺序提取目标实体对的方式构建成结构化关 键信息相比提供了另一变形实施情况,具体地,通过提取事件陈述文本中的时 间特征,如时间戳、年月日等,并建立时间特征与动作实体的关联关系。通过 时间顺序对时间特征进行排序,从而对目标实体对进行排序。具体的,若时间 陈述文本中包含:下午两点,当事人驾驶电动三轮车...,并于晚上九点在xxx 地区倾倒装修垃圾。则提取“下午两点”、“晚上九点”,并将“下午两点”与动 作实体“驾驶”相关联,将“晚上九点”与动作实体“倾倒”相关联。再基于 时间的先后顺序对动作的执行顺序进行排序,得到第一条目标实体对“驾驶- 车”,第二条目标实体对“倾倒-装修垃圾”。Compared with the first embodiment, this embodiment provides another variant implementation situation by using the method of sequentially extracting target entity pairs to construct structured key information. Year, month, day, etc., and establish the relationship between the time feature and the action entity. The temporal features are sorted by chronological order to sort the target entity pairs. Specifically, if the text of the time statement contains: at 2:00 pm, the party drives an electric tricycle... and dumps the decoration garbage in the xxx area at 9:00 pm. Then extract "two o'clock in the afternoon" and "nine o'clock in the evening", and associate "two o'clock in the afternoon" with the action entity "driving", and associate "nine o'clock in the evening" with the action entity "dumping". Then sort the execution sequence of actions based on the time sequence, and obtain the first target entity pair "driving-car" and the second target entity pair "dumping-decoration garbage".
综上,本申请实施例提供了一种事件陈述文本的关键信息提取方法,目的 在于自动从无结构化过程文本中提取关键信息。为实现该目的,本方案根据自 然文本蕴含的案例信息中具备的“事件以动词词组表现”以及“动词需与主语 或并与共同构成一个完整事件”的两个特点对文本进行处理:先采集事件文本, 再将事件文本进行分词,将分词后的词组输入已训练的分类器中进行识别,输 出动作实体以及对象实体。然后通过建立动作实体以及对象实体的实际关联关 系得到目标实体对,再根据目标实体对的出现顺序进行排序,最终从事件陈述 文本中得到结构化关键信息。通过本方法能够实现无结构过程文本向结构化关 键信息集合的转换,从而辅助一线执法队员进行后续的案件信息检索和案例学 习。To sum up, the embodiments of the present application provide a method for extracting key information from event statement text, which aims to automatically extract key information from unstructured process text. In order to achieve this purpose, this scheme processes the text according to the two characteristics of the case information contained in the natural text: "events are represented by verb phrases" and "verbs need to be combined with the subject or together to form a complete event": first collect The event text is then segmented, and the segmented phrase is input into the trained classifier for recognition, and the action entity and the object entity are output. Then, the target entity pair is obtained by establishing the actual relationship between the action entity and the object entity, and then the target entity pair is sorted according to the appearance order, and finally the structured key information is obtained from the event statement text. This method can realize the transformation of unstructured process text to structured key information set, so as to assist front-line law enforcement team members to carry out subsequent case information retrieval and case study.
图7是根据本申请的第二实施例的事件陈述文本的关键信息提取装置的结 构框图。Fig. 7 is a structural block diagram of an apparatus for extracting key information of event statement text according to the second embodiment of the present application.
如图7所示,本申请的一个实施例提出了一种事件陈述文本的关键信息提 取装置,包括:As shown in Figure 7, an embodiment of the present application proposes a device for extracting key information from event statement text, including:
实体识别模块701,用于将事件陈述文本分词后输入已训练的分类器中进行 识别,得到所述已训练的分类器输出的动作实体以及对象实体。The
实体关联模块702,用于根据所述对象实体与所述动作实体的关联映射创建 待识别实体对。The
关系识别模块703,用于提取所述待识别实体对的特征向量,将所述特征向 量输入已训练的关系识别模型中,得到由所述已训练的关系识别模型输出的目 标实体对。The
顺序映射模块704,用于在所述目标实体对的数量包含多个的情况下,根据 每个所述目标实体对在所述事件陈述文本中的出现顺序将多个所述目标实体对 顺序映射,得到结构化关键信息。A
图8是根据本申请第三实施例的电子装置的硬件结构示意图。FIG. 8 is a schematic diagram of a hardware structure of an electronic device according to a third embodiment of the present application.
如图8所示,本申请一个实施例的电子装置,包括存储器804和处理器802, 该存储器804中存储有计算机程序,该处理器802被设置为运行计算机程序以 执行上述任一项方法实施例中的步骤。As shown in FIG. 8 , an electronic device according to an embodiment of the present application includes a
具体地,上述处理器802可以包括中央处理器(CPU),或者特定集成电路 (ApplicationSpecificIntegratedCircuit,简称为ASIC),或者可以被配置成实施 本申请实施例的一个或多个集成电路。Specifically, the above-mentioned
其中,存储器804可以包括用于数据或指令的大容量存储器804。举例来 说而非限制,存储器804可包括硬盘驱动器(HardDiskDrive,简称为HDD)、 软盘驱动器、固态驱动器(SolidStateDrive,简称为SSD)、闪存、光盘、磁光 盘、磁带或通用串行总线(UniversalSerialBus,简称为USB)驱动器或者两个 或更多个以上这些的组合。在合适的情况下,存储器804可包括可移除或不可 移除(或固定)的介质。在合适的情况下,存储器804可在数据处理装置的内 部或外部。在特定实施例中,存储器804是非易失性(Non-Volatile)存储器。 在特定实施例中,存储器804包括只读存储器(Read-OnlyMemory,简称为ROM) 和随机存取存储器(RandomAccessMemory,简称为RAM)。在合适的情况下, 该ROM可以是掩模编程的ROM、可编程ROM(ProgrammableRead-OnlyMemory, 简称为PROM)、可擦除PROM(ErasableProgrammableRead-OnlyMemory,简 称为EPROM)、电可擦除PROM (ElectricallyErasableProgrammableRead-OnlyMemory,简称为EEPROM)、电 可改写ROM(ElectricallyAlterableRead-OnlyMemory,简称为EAROM)或闪存 (FLASH)或者两个或更多个以上这些的组合。在合适的情况下,该RAM可以 是静态随机存取存储器(StaticRandom-AccessMemory,简称为SRAM)或动态 随机存取存储器(DynamicRandomAccessMemory,简称为DRAM),其中,DRAM 可以是快速页模式动态随机存取存储器804 (FastPageModeDynamicRandomAccessMemory,简称为FPMDRAM)、扩展数 据输出动态随机存取存储器(ExtendedDateOutDynamicRandomAccessMemory, 简称为EDODRAM)、同步动态随机存取内存 (SynchronousDynamicRandom-AccessMemory,简称SDRAM)等。Among others,
存储器804可以用来存储或者缓存需要处理和/或通信使用的各种数据文件, 以及处理器802所执行的可能的计算机程序指令。
处理器802通过读取并执行存储器804中存储的计算机程序指令,以实现 上述实施例中的任意一种事件陈述文本的关键信息提取方法。The
可选地,上述电子装置还可以包括传输设备806以及输入输出设备808, 其中,该传输设备806和上述处理器802连接,该输入输出设备808和上述处 理器802连接。Optionally, the above-mentioned electronic device may further include a
传输设备806可以用来经由一个网络接收或者发送数据。上述的网络具体 实例可包括电子装置的通信供应商提供的有线或无线网络。在一个实例中,传 输设备包括一个网络适配器(Network Interface Controller,简称为NIC),其可 通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输 设备806可以为射频(RadioFrequency,简称为RF)模块,其用于通过无线方 式与互联网进行通讯。
输入输出设备808用于输入或输出信息。在本实施例中,输入的信息可以 是事件陈述文本等,输出的信息可以是结构化关键信息等。Input-
可选地,在本实施例中,上述处理器802可以被设置为通过计算机程序执 行以下步骤:Optionally, in this embodiment, the above-mentioned
S101、将事件陈述文本分词后输入已训练的分类器中进行识别,得到所述 已训练的分类器输出的动作实体以及对象实体。S101, inputting the event statement text into a trained classifier for identification after word segmentation, and obtaining the action entity and the object entity output by the trained classifier.
S102、根据所述对象实体与所述动作实体的关联映射创建待识别实体对。S102. Create an entity pair to be identified according to the association mapping between the object entity and the action entity.
S103、提取所述待识别实体对的特征向量,将所述特征向量输入已训练的 关系识别模型中,得到由所述已训练的关系识别模型输出的目标实体对。S103, extract the feature vector of the entity pair to be recognized, and input the feature vector into the trained relationship recognition model to obtain the target entity pair output by the trained relationship recognition model.
S104、在所述目标实体对的数量包含多个的情况下,根据每个所述目标实 体对在所述事件陈述文本中的出现顺序将多个所述目标实体对顺序映射,得到 结构化关键信息。S104. In the case where the number of the target entity pairs includes a plurality of pairs, map the order of the plurality of target entity pairs according to the appearance order of each target entity pair in the event statement text to obtain a structured key information.
需要说明的是,本实施例中的具体示例可以参考上述实施例及可选实施方 式中所描述的示例,本实施例在此不再赘述。It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementation manners, and details are not described herein again in this embodiment.
通常,各种实施例可以以硬件或专用电路、软件、逻辑或其任何组合来实 现。本发明的一些方面可以以硬件来实现,而其他方面可以以可以由控制器、 微处理器或其他计算设备执行的固件或软件来实现,但是本发明不限于此。尽 管本发明的各个方面可以被示出和描述为框图、流程图或使用一些其他图形表 示,但是应当理解,作为非限制性示例,本文中描述的这些框、装置、系统、 技术或方法可以以硬件、软件、固件、专用电路或逻辑、通用硬件或控制器或 其他计算设备或其某种组合来实现。In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software that may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. Although aspects of the invention may be shown and described as block diagrams, flowcharts, or using some other graphical representation, it should be understood that, by way of non-limiting example, these blocks, apparatuses, systems, techniques or methods described herein may be hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controllers, or other computing devices, or some combination thereof.
本发明的实施例可以由计算机软件来实现,该计算机软件由移动设备的数 据处理器诸如在处理器实体中可执行,或者由硬件来实现,或者由软件和硬件 的组合来实现。包括软件例程、小程序和/或宏的计算机软件或程序(也称为程序 产品)可以存储在任何装置可读数据存储介质中,并且它们包括用于执行特定任 务的程序指令。计算机程序产品可以包括当程序运行时被配置为执行实施例的 一个或多个计算机可执行组件。一个或多个计算机可执行组件可以是至少一个 软件代码或其一部分。另外,在这一点上,应当注意,如图中的逻辑流程的任 何框可以表示程序步骤、或者互连的逻辑电路、框和功能、或者程序步骤和逻 辑电路、框和功能的组合。软件可以存储在诸如存储器芯片或在处理器内实现的存储块等物理介质、诸如硬盘或软盘等磁性介质、以及诸如例如DVD及其数 据变体、CD等光学介质上。物理介质是非瞬态介质。Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products), including software routines, applets, and/or macros, can be stored in any device-readable data storage medium and include program instructions for performing particular tasks. A computer program product may include one or more computer-executable components configured to perform the embodiments when the program is run. One or more computer-executable components may be at least one software code or a portion thereof. In addition, at this point, it should be noted that any blocks of the logic flow in the figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. Software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard disks or floppy disks, and optical media such as, for example, DVD and its data variants, CDs. Physical media are non-transitory media.
本领域的技术人员应该明白,以上实施例的各技术特征可以进行任意的组 合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行 描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记 载的范围。Those skilled in the art should understand that the technical features of the above embodiments can be combined arbitrarily. In order to simplify the description, all possible combinations of the technical features in the above embodiments are not described. However, as long as these technical features There is no contradiction in the combination of the above, and they should be considered to be within the scope of the description in this specification.
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但 并不能因此而理解为对本申请范围的限制。应当指出的是,对于本领域的普通 技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进, 这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为 准。The above examples only represent several embodiments of the present application, and their descriptions are more specific and detailed, but should not therefore be construed as limiting the scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111534803.1ACN114372462A (en) | 2021-12-15 | 2021-12-15 | Method, device and application for extracting key information from event statement text |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111534803.1ACN114372462A (en) | 2021-12-15 | 2021-12-15 | Method, device and application for extracting key information from event statement text |
| Publication Number | Publication Date |
|---|---|
| CN114372462Atrue CN114372462A (en) | 2022-04-19 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111534803.1APendingCN114372462A (en) | 2021-12-15 | 2021-12-15 | Method, device and application for extracting key information from event statement text |
| Country | Link |
|---|---|
| CN (1) | CN114372462A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109408642A (en)* | 2018-08-30 | 2019-03-01 | 昆明理工大学 | A kind of domain entities relation on attributes abstracting method based on distance supervision |
| US20190164062A1 (en)* | 2017-11-28 | 2019-05-30 | International Business Machines Corporation | Data classifier |
| CN110765265A (en)* | 2019-09-06 | 2020-02-07 | 平安科技(深圳)有限公司 | Information classification extraction method and device, computer equipment and storage medium |
| CN111324743A (en)* | 2020-02-14 | 2020-06-23 | 平安科技(深圳)有限公司 | Text relation extraction method and device, computer equipment and storage medium |
| CN111339751A (en)* | 2020-05-15 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Text keyword processing method, device and equipment |
| CN112131882A (en)* | 2020-09-30 | 2020-12-25 | 绿盟科技集团股份有限公司 | Multi-source heterogeneous network security knowledge graph construction method and device |
| CN112214605A (en)* | 2020-11-05 | 2021-01-12 | 腾讯科技(深圳)有限公司 | A text classification method and related device |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190164062A1 (en)* | 2017-11-28 | 2019-05-30 | International Business Machines Corporation | Data classifier |
| CN109408642A (en)* | 2018-08-30 | 2019-03-01 | 昆明理工大学 | A kind of domain entities relation on attributes abstracting method based on distance supervision |
| CN110765265A (en)* | 2019-09-06 | 2020-02-07 | 平安科技(深圳)有限公司 | Information classification extraction method and device, computer equipment and storage medium |
| CN111324743A (en)* | 2020-02-14 | 2020-06-23 | 平安科技(深圳)有限公司 | Text relation extraction method and device, computer equipment and storage medium |
| CN111339751A (en)* | 2020-05-15 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Text keyword processing method, device and equipment |
| CN112131882A (en)* | 2020-09-30 | 2020-12-25 | 绿盟科技集团股份有限公司 | Multi-source heterogeneous network security knowledge graph construction method and device |
| CN112214605A (en)* | 2020-11-05 | 2021-01-12 | 腾讯科技(深圳)有限公司 | A text classification method and related device |
| Publication | Publication Date | Title |
|---|---|---|
| CN110750635B (en) | French recommendation method based on joint deep learning model | |
| CN112395395B (en) | Text keyword extraction method, device, equipment and storage medium | |
| CN106776538A (en) | The information extracting method of enterprise's noncanonical format document | |
| CN113282955B (en) | Method, system, terminal and medium for extracting private information in privacy policy | |
| CN111460162B (en) | Text classification method and device, terminal equipment and computer readable storage medium | |
| CN117271781B (en) | Data cross-border compliance evaluation system | |
| CN114003721A (en) | Construction method, device and application of dispute event type classification model | |
| WO2021098651A1 (en) | Method and apparatus for acquiring risk entity | |
| CN118643168A (en) | Construction plan compliance review system and method based on knowledge graph and big model | |
| CN112100398A (en) | Patent blank prediction method and system | |
| KR20220074576A (en) | A method and an apparatus for extracting new words based on deep learning to generate marketing knowledge graphs | |
| CN105956192A (en) | Method and system for acquiring shortened form of organization name based on website homepage information | |
| CN115730087A (en) | Analysis and early warning method of conflicts and disputes based on knowledge graph and its application | |
| CN114265931A (en) | A method and system for consumer policy perception analysis based on big data text mining | |
| Tang et al. | Research on automatic labeling of imbalanced texts of customer complaints based on text enhancement and layer-by-layer semantic matching | |
| CN109446299B (en) | Method and system for searching e-mail content based on event recognition | |
| CN117422074A (en) | Method, device, equipment and medium for standardizing clinical information text | |
| CN118966203A (en) | A large model enhanced semi-open knowledge extraction method and system | |
| CN111542815B (en) | Systems, methods, and computer program products for mining text documents to identify groundbreaking questions and groundbreaking cases | |
| CN118503454B (en) | Data query method, device, storage medium and computer program product | |
| CN117972025B (en) | Massive text retrieval matching method based on semantic analysis | |
| CN113535979A (en) | A method and system for constructing knowledge graph in mineral field | |
| CN114372462A (en) | Method, device and application for extracting key information from event statement text | |
| CN117743576A (en) | A method, device and readable storage medium for classifying goods and services | |
| CN117313721A (en) | Document management method and device based on natural language processing technology |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| AD01 | Patent right deemed abandoned | Effective date of abandoning:20250704 | |
| AD01 | Patent right deemed abandoned |