CN116737924A

Movatterモバイル変換

Info

Publication number: CN116737924A
Application number: CN202310478699.1A
Authority: CN
Inventors: 李琴; 杨斌; 文治中; 宋黎晓
Original assignee: Baiyang Intelligent Technology Group Co ltd
Current assignee: Beijing Baiyang Chengchuang Pharmaceutical R&d Co ltd
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-09-12
Anticipated expiration: 2043-04-27
Also published as: CN116737924B

Abstract

The invention relates to a medical text data processing method and a device, wherein the method comprises the following steps: extracting a data set according to the collected public medical information to finely tune a Chinese medical pre-training model MC-BERT so as to obtain a relatively robust language model; dividing an input text into word element sets with the length of N by a word segmentation mode based on word granularity, constructing a token span matrix with the length of N, predicting the head and tail positions of medical entities according to the matrix, and identifying a text range corresponding to the entities; and sending the entity pairs with the medical relations into a fusion distance sensing multi-relation classifier, finally determining the medical entity relations, and outputting a structured result. The invention utilizes the natural language understanding technology based on deep learning, reads and understands medical texts through a machine, and automatically extracts a large number of professional medical entities and relations, thereby remarkably improving the efficiency and quality of medical clinical scientific research and having great significance for constructing a special hospital database.

Description

Translated fromChinese

一种医疗文本数据处理方法及装置A medical text data processing method and device

技术领域Technical field

本发明属于信息处理技术领域，特别涉及一种使用人工智能技术对医疗文本进行处理的方法及装置。The invention belongs to the field of information processing technology, and in particular relates to a method and device for processing medical text using artificial intelligence technology.

背景技术Background technique

人工智能(Artificial Intelligence，Al)指由人制造出来的机器所表现出来的智能。通常人工智能是指通过普通电脑实现的智能。人工智能包括弱人工智能和强人工智能。一般认为，弱人工智能(也称狭义人工智能)指的是专注于解决某个特定领域问题的人工智能技术，也可以认为是应用于该领域的技术工具。Artificial Intelligence (Al) refers to the intelligence displayed by machines made by humans. Usually artificial intelligence refers to intelligence achieved through ordinary computers. Artificial intelligence includes weak artificial intelligence and strong artificial intelligence. It is generally believed that weak artificial intelligence (also called narrow artificial intelligence) refers to artificial intelligence technology that focuses on solving problems in a specific field, and can also be considered as a technical tool applied in this field.

自然语言处理技术是狭义人工智能的一个重要分支，注重于对自然语言的处理和运用，在人机交互中已经得到了广泛的应用。自然语言处理的范畴包括信息检索、信息抽取、机器翻译、文本朗读、分词、词性标注、自动摘要等领域。Natural language processing technology is an important branch of narrow artificial intelligence, focusing on the processing and application of natural language, and has been widely used in human-computer interaction. The scope of natural language processing includes information retrieval, information extraction, machine translation, text reading, word segmentation, part-of-speech tagging, automatic summarization and other fields.

在健康医疗大数据领域的实际应用中，使用自然语言处理技术中的分词、标注，可以对于医生使用自然语言描述的病历进行分析，从中提取病人的症状、诊疗信息和事件等信息。这些信息的获得和标准化对于医生的临床科研研究以及人工智能辅助诊疗系统等应用的搭建都起到重要的作用。In practical applications in the field of health and medical big data, word segmentation and annotation in natural language processing technology can be used to analyze medical records described by doctors in natural language, and extract patient symptoms, diagnosis and treatment information, events and other information. The acquisition and standardization of this information play an important role in doctors' clinical research and the construction of applications such as artificial intelligence-assisted diagnosis and treatment systems.

医疗文本数据中蕴含了丰富的医学信息，医疗文本的结构化是将以电子病历、检验报告为代表的不规则医疗文本进行结构化分析，结合临床医学实体概念，让机器自动地从语言文本中提取出来用户想要的关键信息。这些信息有助于支撑临床学术研究、医学知识图谱构建、临床辅助决策等应用场景。然而海量的医疗文本对机器而言不可理解、不可计算，且此类数据由于本身的复杂性及专业性，需要医学科研工作者花费大量精力人为的将有效信息从文本中提取出来。为了更加高效的利用这些数据，精准的对医疗文本进行信息提取，目前迫切需要一种针对医疗文本结构化的技术。Medical text data contains a wealth of medical information. The structuring of medical texts involves structured analysis of irregular medical texts represented by electronic medical records and test reports. Combined with the concept of clinical medical entities, the machine can automatically extract information from language texts. Extract the key information that users want. This information helps support clinical academic research, medical knowledge graph construction, clinical decision-making assistance and other application scenarios. However, the massive amount of medical text is incomprehensible and uncomputationable for machines, and due to the complexity and professionalism of such data, medical researchers need to spend a lot of energy to manually extract effective information from the text. In order to utilize these data more efficiently and accurately extract information from medical texts, a technology for structuring medical texts is urgently needed.

目前已有的方案中，主要使用实体关系联合抽取模型进行医疗文本的实体及关系识别，一般将实体识别任务及实体关系抽取任务联合建模，通过共用编码器实现模型的参数共享直接得到存在关系的实体三元组。这类方案通常采用BiLSTM或中文预训练BERT对文本编码编码，忽略了预训练模型使用医学文本做域迁移(Domain Transfer)的重要性，基于大量医学语料微调得到的语言模型含有丰富的医学先验知识，在特征表达能力要好于基于通用语料训练得到的预训练模型。其次此类方案往往忽略了医学实体嵌套情形，例如“右肺占位”代表病变类型，“右肺占位”中的“右肺”则代表身体部位，两种不同类型的实体存在着嵌套关系，导致已有方案在嵌套实体情况下失效。至于在医学关系识别中，已有方案灵活性差，不能根据不同的关系模式快速定制关系分类器，这制约了模型的可拓展性。Among the existing solutions, the entity relationship joint extraction model is mainly used to identify entities and relationships in medical texts. Generally, the entity recognition task and the entity relationship extraction task are jointly modeled, and the existence relationship is directly obtained by sharing the parameters of the model through a shared encoder. entity triplet. Such solutions usually use BiLSTM or Chinese pre-trained BERT to encode text, ignoring the importance of using medical text for domain transfer in the pre-trained model. The language model fine-tuned based on a large amount of medical corpus contains rich medical priors. Knowledge, the feature expression ability is better than the pre-trained model trained based on general corpus. Secondly, such solutions often ignore the nesting of medical entities. For example, "right lung occupancy" represents the type of disease, and the "right lung" in "right lung occupancy" represents the body part. There are nested relationships between two different types of entities. nested relationships, causing existing solutions to fail in the case of nested entities. As for medical relationship recognition, existing solutions have poor flexibility and cannot quickly customize relationship classifiers according to different relationship patterns, which restricts the scalability of the model.

发明内容Contents of the invention

针对现有技术存在的上述问题，本发明提供了一种医疗文本结构化方法及装置，通过利用自然语言理解技术，结合医学预训练模型以及基于距离感知的关系分类器，实现了从医疗文本中精准的提取关键信息，形成结构化数据。In view of the above-mentioned problems existing in the prior art, the present invention provides a medical text structuring method and device. By utilizing natural language understanding technology, combined with a medical pre-training model and a distance-aware relationship classifier, it is possible to extract medical text from medical text. Accurately extract key information and form structured data.

为了达到上述目的，本发明提供了一种医疗文本结构化方法，含有以下步骤：In order to achieve the above objectives, the present invention provides a medical text structuring method, which includes the following steps:

根据获取到的公开医学信息抽取数据构建训练集，微调中文医疗预训练模型MC-BERT，完成参数的域迁移(Domain Transfer)；Build a training set based on the obtained public medical information extraction data, fine-tune the Chinese medical pre-training model MC-BERT, and complete the domain transfer of parameters (Domain Transfer);

基于微调后的MC-BERT将临床医疗文本分词后得到长度为N的词元集合并构造N*N的span矩阵，其中N为自然数，将分词后医学文本送入MC-BERT获得编码向量，利用矩阵的起止位置判别出医学实体所对应的文本范围，抽取医学实体；Based on the fine-tuned MC-BERT, the clinical medical text is segmented to obtain a word set of length N and constructed to construct an N*N span matrix, where N is a natural number. The segmented medical text is sent to MC-BERT to obtain the encoding vector, using The starting and ending positions of the matrix are used to determine the text range corresponding to the medical entity, and the medical entity is extracted;

基于全连接层的多分类器，对存在医学关系的实体对进行关系判别，抽取医学实体关系。Based on the multi-classifier of the fully connected layer, the relationship between entity pairs with medical relationships is distinguished and the medical entity relationships are extracted.

将提取到的医学实体和医学实体关系进行结果融合。Fusion of the extracted medical entities and medical entity relationships.

作为优选，所述公开医学信息抽取数据集为CHIP2020中文医学文本命名实体识别、中文医学实体关系抽取数据集，CCKS2020医疗命名实体识别、医疗实体及属性抽取数据集。Preferably, the public medical information extraction data set is CHIP2020 Chinese medical text named entity recognition, Chinese medical entity relationship extraction data set, CCKS2020 medical named entity recognition, medical entity and attribute extraction data set.

作为优选，所述微调中文医疗预训练模型的方法为：基于BIOES编码方式对所有公开医学信息抽取数据集进行序列标注，其中B-Type代表实体的起始，I-Type代表实体的中间，O代表非实体部分，E-Type代表实体的尾部，S-Type代表单字实体，Type代表所对应的医学实体类型。对某类型医学实体Type-a中嵌套其它类型实体Type-b情况时，采用合并标签层的方式，将存在嵌套关系的两种实体类别两两组合，产生新的实体类型标签Type-a|Type-b。通过统一序列标注后的数据以命名实体识别任务为学习目标微调MC-BERT，得到领域迁移后的新语言模型。As a preferred method, the method of fine-tuning the Chinese medical pre-training model is to perform sequence annotation on all public medical information extraction data sets based on the BIOES encoding method, where B-Type represents the beginning of the entity, I-Type represents the middle of the entity, and O Represents the non-entity part, E-Type represents the tail of the entity, S-Type represents the single-word entity, and Type represents the corresponding medical entity type. When a certain type of medical entity Type-a is nested in another type of entity Type-b, the method of merging label layers is used to combine the two entity categories with nested relationships in pairs to generate a new entity type label Type-a. |Type-b. By unifying the sequence annotated data, MC-BERT is fine-tuned with the named entity recognition task as the learning goal, and a new language model after domain migration is obtained.

作为优选，对临床医疗文本数据预处理，清洗并切分长文本；采用BERT模型自带的字典文件进行分词，得到的长度为N的token集合并构造N*N的词元矩阵span用于编码实体标签，矩阵的下标值span[start][end]＝C，其中[start][end]代表医学实体所对应文本的起止范围，C代表实体类别，C＝0时则表示非实体文本；通过微调后的MC-Bert作为embedding，得到span[start][end]所对应文本片段的实体类型逻辑得分，得分大于阈值α视为有效实体。As an option, preprocess the clinical medical text data, clean and segment the long text; use the dictionary file that comes with the BERT model for word segmentation, and the obtained token set with a length of N is constructed to construct an N*N token matrix span for encoding. Entity tag, the subscript value of the matrix span[start][end]=C, where [start][end] represents the starting and ending range of the text corresponding to the medical entity, C represents the entity category, and when C=0, it represents non-entity text; By using the fine-tuned MC-Bert as embedding, the entity type logical score of the text segment corresponding to span[start][end] is obtained. A score greater than the threshold α is considered a valid entity.

作为优选，所述标注出的有效实体通过以下公式进行实体间关系的确定：Preferably, the marked valid entities are used to determine the relationship between entities through the following formula:

式中，M代表实体关系类别总数，p_i表示第i个实体对所代表的上下文向量表示，d_i表示第i个实体对间的相对距离特征向量，字符°表示向量级联操作。In the formula, M represents the total number of entity relationship categories, p_i represents the context vector representation represented by the i-th entity pair, d_i represents the relative distance feature vector between the i-th entity pair, and the character ° represents a vector cascade operation.

作为优选，所述实体对所代表的上下文向量为：Preferably, the context vector represented by the entity pair is:

式中，与/>代表第i个实体对中头实体的首尾特征向量，/>与/>代表第i个实体对中尾实体的首尾特征向量，上述特征向量均从token集合编码向量X_N中获取。该方法还包括：通过构建正负样本指导模型学习医学实体对间的隐含关系，保证模型仅可以判别存在事实医学关系的实体对。In the formula, with/> Represents the head and tail feature vectors of the head entity in the i-th entity pair, /> with/> Represents the first and last feature vectors of the tail entity in the i-th entity pair. The above feature vectors are all obtained from the token set encoding vector X_N. The method also includes: guiding the model to learn implicit relationships between medical entity pairs by constructing positive and negative samples to ensure that the model can only identify entity pairs with factual medical relationships.

作为优选，所述实体对间的相对距离特征向量为：Preferably, the relative distance feature vector between the entity pairs is:

d_i＝Linear(|s_i2-e_i1|) (3)d_i =Linear(|s_i2 -e_i1 |) (3)

式中，s_i2、e_i1分别代表第i个实体对中尾实体与头实体在BERT位置编码(positionembedding)中的特征向量，二者向量经相减取绝对值后表示实体对中两个医学实体的相对位置关系，Linear(·)函数表示通过全连接层对实体对的位置向量做进一步非线性映射。In the formula, s_i2 and e_i1 respectively represent the feature vectors of the tail entity and the head entity in the i-th entity pair in the BERT position encoding (positionembedding). The two vectors are subtracted to take the absolute value and represent the two medical entities in the entity pair. The relative position relationship, the Linear(·) function represents further non-linear mapping of the position vector of the entity pair through the fully connected layer.

作为优选，对所述提取到的医学实体和医学实体关系进行遍历，去除文本过长的医学实体，将存在医学关系的实体对以{头部实体-医学关系，尾部实体}格式可视化并保存，将独立存在的医学实体以{实体类型，实体值}格式可视化并保存。Preferably, the extracted medical entities and medical entity relationships are traversed, medical entities with too long texts are removed, and entity pairs with medical relationships are visualized and saved in the format of {head entity-medical relationship, tail entity}, Visualize and save independently existing medical entities in {entity type, entity value} format.

本发明还提供了一种医疗文本结构化装置，包括：The invention also provides a medical text structuring device, including:

数据预处理模块，用于清洗处理输入的医疗文本；The data preprocessing module is used to clean and process the input medical text;

医学实体抽取模块，将所述清洗处理后的医学文本输入至微调后的自然语言识别模型，抽取出医学实体所对应的文本片段；The medical entity extraction module inputs the cleaned medical text into the fine-tuned natural language recognition model and extracts text fragments corresponding to the medical entities;

医学实体关系抽取模块，利用距离感知的关系分类器抽取出医学实体对间的事实关系；The medical entity relationship extraction module uses a distance-aware relationship classifier to extract factual relationships between pairs of medical entities;

双阶段结果融合模块，用于将所述医学实体和医学实体关系进行结果融合并予以展示；A two-stage result fusion module, used to fuse the results of the medical entities and medical entity relationships and display them;

与现有技术相比，本发明的优点和积极效果在于：Compared with the existing technology, the advantages and positive effects of the present invention are:

本发明提供医疗文本结构化方法，注重预训练语言模型对文本的特征提取能力，针对医疗文本结构化任务特点，采用医学信息抽取数据集以命名实体识别为切入点微调中文医疗预训练模型，实现了语言模型的领域适配。得到微调后的预训练模型后，基于tokenspan矩阵的方式编码实体标签，确保了嵌套实体的可识别；基于距离感知的实体关系分类器，学习了实体间的上下文关系，通过构建正负样本保证模型仅可以判别存在事实医学关系的实体对；通过二阶段的结果融合输出结构化内容，提升了临床医学文本的数据利用效率。The present invention provides a medical text structuring method, focusing on the feature extraction ability of the pre-trained language model for text. According to the characteristics of the medical text structuring task, the medical information extraction data set is used to fine-tune the Chinese medical pre-training model with named entity recognition as the entry point to achieve Domain adaptation of language models. After obtaining the fine-tuned pre-training model, the entity labels are encoded based on the tokenspan matrix to ensure the recognition of nested entities; the distance-aware entity relationship classifier learns the contextual relationship between entities and ensures that by constructing positive and negative samples The model can only identify entity pairs with factual medical relationships; the second-stage result fusion outputs structured content, which improves the data utilization efficiency of clinical medical texts.

附图说明Description of drawings

图1为本发明实施例的医疗文本结构化方法流程图；Figure 1 is a flow chart of a medical text structuring method according to an embodiment of the present invention;

图2为本发明实施例的医疗文本结构化方法装置的结构框图；Figure 2 is a structural block diagram of a medical text structuring method and device according to an embodiment of the present invention;

图3为本发明实施例的BIOES编码方式示意图；Figure 3 is a schematic diagram of the BIOES encoding method according to the embodiment of the present invention;

图4为本发明实施例的词元矩阵实体标签示意图；Figure 4 is a schematic diagram of a token matrix entity tag according to an embodiment of the present invention;

具体实施方式Detailed ways

下面，结合附图和具体实施方式对本发明的各个方面进行详细描述。显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例子。在没有进一步叙述的情况下，一个实施方式中的元件、结构和特征也可以有益地结合到其他实施方式中。Below, various aspects of the present invention are described in detail with reference to the drawings and specific embodiments. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Elements, structures, and features of one embodiment may also be beneficially combined in other embodiments without further recitation.

本发明实施例的一种医疗文本结构化方法，如图1所示，包括以下步骤：A medical text structuring method according to the embodiment of the present invention, as shown in Figure 1, includes the following steps:

步骤S1、将收集到的公开医学信息抽取数据集以命名实体识别任务微调中文医疗预训练模型mcBERT，得到域适应的预训练语言模型；具体的，在“微调中文医疗预训练模型mcBERT”之前，包括：Step S1: Extract the collected public medical information data set to fine-tune the Chinese medical pre-training model mcBERT for the named entity recognition task to obtain a domain-adapted pre-training language model; specifically, before "fine-tuning the Chinese medical pre-training model mcBERT", include:

所述公开医学信息抽取数据集为CHIP2020中文医学文本命名实体识别、中文医学实体关系抽取数据集，CCKS2020医疗命名实体识别、医疗实体及属性抽取数据集。The public medical information extraction data set is CHIP2020 Chinese medical text named entity recognition, Chinese medical entity relationship extraction data set, CCKS2020 medical named entity recognition, medical entity and attribute extraction data set.

基于BIOES编码方式对所有收集到的公开医学信息抽取数据集进行序列标注，其中B-Type代表实体的起始，I-Type代表实体的中间，O代表非实体部分，E-Type代表实体的尾部，S-Type代表单字实体，Type代表所对应的医学实体类型。标注实体类型标签主要有：患处的具体部位(Body part)、有无明显的病患指标(Symptom)、生长发育指标(BMI)、患处具体的位置(direction)、疾病名称(Disease)、是否有采样数据(Sample)、疾病的进展情况(Change)、属性特征(Feature)、刺激要素(Incentive)、时间(Time)、疾病所处阶段(Degree)，其中症状的标注实体类型前面可以加–号，以表示该患者不具有该症状或体征，实体之间的关系采用有序对的方式来表示。使用BIOES获取症状和属性的方法步骤如下：Sequence annotation is performed on all collected public medical information extraction data sets based on the BIOES encoding method, where B-Type represents the beginning of the entity, I-Type represents the middle of the entity, O represents the non-entity part, and E-Type represents the tail of the entity. , S-Type represents a single-word entity, and Type represents the corresponding medical entity type. The main types of labeling entities include: the specific part of the affected area (Body part), whether there are obvious patient indicators (Symptom), growth and development indicators (BMI), the specific location of the affected area (direction), the name of the disease (Disease), whether there are Sampling data (Sample), disease progress (Change), attribute characteristics (Feature), stimulus element (Incentive), time (Time), disease stage (Degree), among which the annotation entity type of symptoms can be preceded by a – sign , to indicate that the patient does not have the symptom or sign, and the relationship between entities is expressed in the form of ordered pairs. The steps to use BIOES to obtain symptoms and attributes are as follows:

采用收集的公开医学信息的命名实体识别和关系抽取技术，提取出医学信息的实体，标记出否定症状；Use named entity recognition and relationship extraction technology of collected public medical information to extract entities of medical information and mark negative symptoms;

以患处的具体部位、有无明显的病患指标、生长发育指标、采样数据作为实体，确定该实体对应的属性；Use the specific part of the affected area, whether there are obvious patient indicators, growth and development indicators, and sampling data as entities to determine the corresponding attributes of the entity;

基于有无明显的病患指标，提取患处具体的位置和属性特征；Based on the presence or absence of obvious patient indicators, extract the specific location and attribute characteristics of the affected area;

基于有无明显的病患指标，提取时间、采样数据、疾病所处阶段、疾病的进展情况及刺激要素；Based on the presence or absence of obvious patient indicators, the time, sampling data, disease stage, disease progression and stimulus factors are extracted;

基于有无明显的病患指标，提取疾病的进展情况及刺激要素；Based on the presence or absence of obvious patient indicators, the progression of the disease and stimulus factors are extracted;

基于是否有采样数据，提取属性特征及刺激要素；Based on whether there is sampling data, attribute features and stimulus elements are extracted;

对于提取的实体及属性，进行合并和去重处理。Merge and deduplicate the extracted entities and attributes.

具体在实际的标注过程中，对某类型医学实体Type-a中嵌套其它类型实体Type-b情况时，采用合并标签层的方式，将存在嵌套关系的两种实体类别两两组合，产生新的实体类型标签Type-a|Type-b。例如，图3中示，文本“患者双肺小结节”中“双肺小结节”代表病变实体类型，“双肺”代表部位实体类型，因此对“双肺”标注时，合并其标签为“B-部位|B-病变,E-部位|I-病变”。Specifically, in the actual annotation process, when a certain type of medical entity Type-a is nested with another type of entity Type-b, the method of merging label layers is used to combine the two entity categories with nested relationships in pairs to generate New entity type tags Type-a|Type-b. For example, as shown in Figure 3, in the text "Small nodules in both lungs of the patient", "small nodules in both lungs" represents the lesion entity type, and "double lungs" represents the site entity type. Therefore, when labeling "both lungs", merge its labels It is "B-site | B-lesion, E-site | I-lesion".

MC-BERT是自然语言理解模型BERT在中文医疗问答、中文医疗百科和中文电子病历等大规模中文医学语料上训练得来，诸多医疗知识已经被显式地注入到模型中。再通过统一序列标注后的数据以命名实体识别任务为学习目标微调MC-BERT，可以得到领域迁移后的新语言模型，使得模型更适应于信息抽取任务。MC-BERT is a natural language understanding model BERT trained on large-scale Chinese medical corpus such as Chinese medical question and answer, Chinese medical encyclopedia and Chinese electronic medical records. A lot of medical knowledge has been explicitly injected into the model. Then by fine-tuning MC-BERT with the named entity recognition task as the learning goal through the unified sequence annotation data, a new language model after domain migration can be obtained, making the model more suitable for information extraction tasks.

步骤S2、对临床医疗文本数据预处理，清洗并切分长文本；采用BERT模型自带的vocabulary字典进行分词，得到的长度为N的词元集合并构造N*N的span矩阵用于编码实体标签；使用微调后的MC-Bert作为embedding方式，得到span矩阵所对应文本片段的实体类型逻辑得分，得分大于阈值α视为有效实体。Step S2: Preprocess the clinical medical text data, clean and segment the long text; use the vocabulary dictionary that comes with the BERT model for word segmentation, and the resulting word set of length N is combined to construct an N*N span matrix for encoding entities. Label; use the fine-tuned MC-Bert as the embedding method to obtain the entity type logical score of the text fragment corresponding to the span matrix. The score is greater than the threshold α as a valid entity.

具体的，对临床医疗文本数据预处理，去除非法乱码字符，若文本长度大于BERT支持的上限512，则以512为长度切割长文本，得到多个数据段落；基于BERT自带的名为vocab.txt文件，对医疗文本中出现的中文字符采用字粒度的方式逐字切分，对医学英文字符及数字按照sub-word方式切分，分词后得到的长度为N的词元集合用于构造N*N的span矩阵，span矩阵涵盖了输入文本所有情况的片段排列，保证实体嵌套的情况不再出现。例如，图4所示的文本“右肺占位”经分词后构造了4*4的token span矩阵，span[0][1]＝bod中[0][1]代表矩阵所对应文本的起止范围，即“右肺”，其实体类型为“body”；span[0][3]＝dis中[0][3]代表矩阵所对应文本的起止范围，即“右肺占位”，其实体类型为“dis”，其它非实体部分设为0。使用微调后的MC-Bert作为embedding方式，得到词元集合编码向量X_N，经非线性变换后得到和/>二者的内积作为span矩阵的logits值以评价span[start][end]所对应文本片段的实体类型得分，得分大于阈值α视为有效实体，这里α基于经验设置为0.5。Specifically, clinical medical text data is preprocessed to remove illegal garbled characters. If the text length is greater than the upper limit of 512 supported by BERT, the long text is cut with a length of 512 to obtain multiple data paragraphs; based on BERT's own vocab. txt file, the Chinese characters appearing in the medical text are segmented word-by-word at character granularity, and the medical English characters and numbers are segmented according to the sub-word method. The word set of length N obtained after word segmentation is used to construct N *N span matrix, the span matrix covers the fragment arrangement of all situations of the input text, ensuring that entity nesting will no longer occur. For example, the text "Right Lung Occupancy" shown in Figure 4 is segmented to construct a 4*4 token span matrix. span[0][1]=bod [0][1] represents the start and end of the text corresponding to the matrix. The range is "right lung", and its entity type is "body"; span[0][3]=[0][3] in dis represents the starting and ending range of the text corresponding to the matrix, which is the "right lung occupancy". In fact, The body type is "dis" and other non-entity parts are set to 0. Using fine-tuned MC-Bert as the embedding method, the word set encoding vector X_N is obtained, which is obtained after nonlinear transformation and/> The inner product of the two is used as the logits value of the span matrix to evaluate the entity type score of the text segment corresponding to span[start][end]. A score greater than the threshold α is considered a valid entity, where α is set to 0.5 based on experience.

步骤S3、基于全连接层的多分类器，对存在医学关系的实体对进行关系判别，抽取医学实体关系。Step S3: Based on the multi-classifier of the fully connected layer, perform relationship discrimination on entity pairs with medical relationships, and extract medical entity relationships.

具体的，将标注的医学实体以pair对的方式构造训练集，存在事实医学关系的实体对定义为正样本，对不存在医学关系的实体对进行随机采样后定义为负样本，保证模型仅判别存在事实医学关系的实体对。所述实体对通过以下公式进行实体间的关系确定：Specifically, the labeled medical entities are constructed as pairs in the training set. Entity pairs with factual medical relationships are defined as positive samples. Entity pairs without medical relationships are randomly sampled and defined as negative samples to ensure that the model only discriminates Entity pairs with factual medical relationships. The relationship between the entities is determined through the following formula:

所述实体对所代表的上下文向量为：The context vector represented by the entity pair is:

式中，与/>代表第i个实体对中头实体的首尾特征向量，/>与/>代表第i个实体对中尾实体的首尾特征向量，上述特征向量均从token集合编码向量X_N中获取。In the formula, with/> Represents the head and tail feature vectors of the head entity in the i-th entity pair, /> with/> Represents the first and last feature vectors of the tail entity in the i-th entity pair. The above feature vectors are all obtained from the token set encoding vector X_N.

所述实体对间的相对距离特征向量为：The relative distance feature vector between the entity pairs is:

d_i＝Linear(|s_i2-e_i1|) (3)d_i =Linear(|s_i2 -e_i1 |) (3)

式中，s_i2、e_i1分别代表第i个实体对中尾实体与头实体在BERT位置编码(positionembedding)中的特征向量，二者向量经相减取绝对值后表示实体对中两个医学实体的相对位置关系，Linear(·)函数表示通过全连接层对实体对的位置向量做进一步非线性映射。映射后的位置向量与实体对向量保持维度一致，以级联的方式完成特征融合。In the formula, s_i2 and e_i1 respectively represent the feature vectors of the tail entity and the head entity in the i-th entity pair in the BERT position encoding (positionembedding). The two vectors are subtracted to take the absolute value and represent the two medical entities in the entity pair. The relative position relationship, the Linear(·) function represents further non-linear mapping of the position vector of the entity pair through the fully connected layer. The mapped position vector maintains the same dimension as the entity pair vector, and feature fusion is completed in a cascade manner.

步骤S4、提取到的医学实体和医学实体关系进行遍历，去除文本过长的医学实体，将存在医学关系的实体对以{头实体-医学关系，尾部实体}格式可视化并保存，将独立存在的医学实体以{实体类型，实体值}格式可视化并保存。如“患者于2020年1月行CT检查示双肺结节”文本经步骤S2、S3后将提取(日期，2020年1月)，(检查手段，CT)，(病变，双肺结节)，其中“日期”与“检查手段”间存在“检查日期”这种关系，将其格式化为：{CT-检查日期，2020年1月}；其中“病变”这一实体独立存在，不与其它实体存在医学关系，将其格式化为：{病变，双肺结节}。Step S4: Traverse the extracted medical entities and medical entity relationships, remove medical entities with too long texts, visualize and save the entity pairs with medical relationships in the format of {head entity-medical relationship, tail entity}, and save the independently existing medical entities. Medical entities are visualized and saved in {entity type, entity value} format. For example, the text "The patient underwent CT examination in January 2020 and showed bilateral lung nodules" will be extracted after steps S2 and S3 (date, January 2020), (examination method, CT), (lesion, bilateral lung nodules) , where there is a relationship of "examination date" between "date" and "examination method", which is formatted as: {CT-examination date, January 2020}; where the entity "lesion" exists independently and is not related to Other entities have medical relationships, format them as: {lesion, bilateral lung nodules}.

综上所述，本发明提供一种医疗文本结构化方法，可将输入的医疗文本自动地进行结构化提取，获得大量专业医学实体及关系，显著提升医学临床科研的效率及质量。In summary, the present invention provides a medical text structuring method, which can automatically extract the input medical text in a structured manner, obtain a large number of professional medical entities and relationships, and significantly improve the efficiency and quality of medical clinical research.

实施例2：参见图2，本实施例提供了一种医疗文本结构化装置。各功能模型详细说明如下：Embodiment 2: Referring to Figure 2, this embodiment provides a medical text structuring device. The detailed description of each functional model is as follows:

具体的，所述医学实体抽取模块，使用域迁移后的医疗预训练模型MC-BERT作embedding，通过对token span矩阵下标所对应文本范围判别是否为预定义医学实体；Specifically, the medical entity extraction module uses the medical pre-training model MC-BERT after domain migration for embedding, and determines whether it is a predefined medical entity by judging the text range corresponding to the token span matrix subscript;

具体的，所述医学实体关系抽取模块，构造的正负样本对进行模型的训练，学习过程中融入实体位置特征向量，使用多分类器进行实体之间的关系识别。Specifically, the medical entity relationship extraction module uses constructed positive and negative sample pairs to train the model, incorporates entity position feature vectors into the learning process, and uses multiple classifiers to identify relationships between entities.

进一步地，所述医疗文本结构化装置还包括：标注模块，对临床医疗文本数据进行实体及关系标注。Further, the medical text structuring device further includes: a labeling module for entity and relationship labeling of clinical medical text data.

上述实施例用来解释本发明，而非对其进行限制，在本发明的精神和权利要求的保护范围内，对本发明做出的任何修改和改变，都应包含再本发明的保护范围之内。The above embodiments are used to explain the present invention, but not to limit it. Within the spirit of the present invention and the protection scope of the claims, any modifications and changes made to the present invention should be included in the protection scope of the present invention. .

Claims

Translated fromChinese

1.一种医疗文本数据处理方法，其特征在于，所述方法包括：1. A medical text data processing method, characterized in that the method includes:

根据获取到的公开医学信息抽取数据集构建训练集，微调中文医疗预训练模型MC-BERT，完成参数的域迁移；Build a training set based on the obtained public medical information extraction data set, fine-tune the Chinese medical pre-training model MC-BERT, and complete the domain migration of parameters;

基于微调后的MC-BERT将临床医疗文本分词后得到长度为N的词元(token)合集并构造N*N的矩阵，其中N为自然数，随后将分词后的医学文本送入MC-BERT获得编码向量，利用矩阵的位置坐标反推出医学实体所对应的文本范围，抽取医学实体；Based on the fine-tuned MC-BERT, the clinical medical text is segmented to obtain a collection of tokens with a length of N and a matrix of N*N is constructed, where N is a natural number. The segmented medical text is then sent to MC-BERT to obtain Encoding vector, using the position coordinates of the matrix to deduce the text range corresponding to the medical entity, and extracting the medical entity;

基于全连接层的多分类器，对存在医学关系的实体对进行关系判别，抽取医学实体关系；A multi-classifier based on the fully connected layer performs relationship discrimination on entity pairs with medical relationships and extracts medical entity relationships;

2.根据权利要求1所述的一种医疗文本数据处理方法，其特征在于，所述公开医学信息抽取数据集为CHIP2020中文医学文本命名实体识别、中文医学实体关系抽取数据集，CCKS2020医疗命名实体识别、医疗实体及属性抽取数据集。2. A medical text data processing method according to claim 1, characterized in that the public medical information extraction data set is CHIP2020 Chinese medical text named entity recognition, Chinese medical entity relationship extraction data set, CCKS2020 medical named entity Recognition, medical entity and attribute extraction dataset.

3.根据权利要求1所述的一种医疗文本数据处理方法，其特征在于，所述微调中文医疗预训练模型的方法为：基于BIOES编码方式对所有收集到的公开医学信息抽取数据集进行序列标注，其中B-Type代表实体的起始，I-Type代表实体的中间，O代表非实体部分，E-Type代表实体的尾部，S-Type代表单字实体，Type代表所对应的医学实体类型。对某类型医学实体Type-a中嵌套其它类型实体Type-b情况时，采用合并标签层的方式，将存在嵌套关系的两种实体类别两两组合，产生新的实体类型标签Type-a|Type-b。通过统一序列标注后的数据以命名实体识别任务为学习目标微调MC-BERT，得到领域迁移后的新语言模型。3. A medical text data processing method according to claim 1, characterized in that the method of fine-tuning the Chinese medical pre-training model is: sequence all collected public medical information extraction data sets based on the BIOES encoding method. Annotation, where B-Type represents the beginning of the entity, I-Type represents the middle of the entity, O represents the non-entity part, E-Type represents the tail of the entity, S-Type represents the single-word entity, and Type represents the corresponding medical entity type. When a certain type of medical entity Type-a is nested in another type of entity Type-b, the method of merging label layers is used to combine the two entity categories with nested relationships in pairs to generate a new entity type label Type-a. |Type-b. By unifying the sequence annotated data, MC-BERT is fine-tuned with the named entity recognition task as the learning goal, and a new language model after domain migration is obtained.

4.根据权利要求1所述的一种医疗文本数据处理方法，其特征在于，所述抽取医学实体具体步骤为：对临床医疗文本数据预处理，清洗并切分长文本；采用BERT模型自带的字典文件进行分词，得到的长度为N的词元集合并构造N*N的span矩阵用于编码实体标签，矩阵的下标值span[start][end]＝C，其中[start][end]代表医学实体所对应文本的起止范围，C代表实体类别，C＝0时则表示非实体文本；通过微调后的MC-Bert作为embedding，得到span[start][end]所对应文本片段的实体类型逻辑得分，得分大于阈值α视为有效实体。4. A medical text data processing method according to claim 1, characterized in that the specific steps of extracting medical entities are: preprocessing clinical medical text data, cleaning and segmenting long texts; using the BERT model built-in Dictionary files are segmented, and the obtained word sets of length N are combined to construct an N*N span matrix for encoding entity tags. The subscript value of the matrix span[start][end]=C, where [start][end] ] represents the starting and ending range of the text corresponding to the medical entity, C represents the entity category, and when C=0, it represents non-entity text; by using the fine-tuned MC-Bert as embedding, the entity corresponding to the text segment of span[start][end] is obtained Type logical score, a score greater than the threshold α is considered a valid entity.

5.根据权利要求3所述的一种医疗文本数据处理方法，其特征在于，所述将标注出的有效实体通过以下公式进行实体间关系的确定：5. A medical text data processing method according to claim 3, characterized in that the marked valid entities are determined by the following formula:

6.根据权利要求5所述的一种医疗文本数据处理方法，其中标注实体类型标签主要有：患处的具体部位(Body part)、有无明显的病患指标(Symptom)、生长发育指标(BMI)、患处具体的位置(direction)、疾病名称(Disease)、是否有采样数据(Sample)、疾病的进展情况(Change)、属性特征(Feature)、刺激要素(Incentive)、时间(Time)、疾病所处阶段(Degree)，其中症状的标注实体类型前面可以加–号，以表示该患者不具有该症状或体征，实体之间的关系采用有序对的方式来表示，具体标注方法步骤如下：6. A medical text data processing method according to claim 5, wherein the entity type tags mainly include: the specific part of the affected area (Body part), whether there are obvious patient indicators (Symptom), and growth and development indicators (BMI). ), the specific location of the affected area (direction), the name of the disease (Disease), whether there is sampling data (Sample), the progress of the disease (Change), attribute characteristics (Feature), stimulus element (Incentive), time (Time), disease In the stage (Degree), the entity type of the symptom label can be preceded by a – sign to indicate that the patient does not have the symptom or sign. The relationship between entities is expressed in an ordered pair. The specific labeling method steps are as follows:

7.据权利要求4所述的一种医疗文本数据处理方法，其特征在于，所述实体对所代表的上下文向量为：7. A medical text data processing method according to claim 4, characterized in that the context vector represented by the entity pair is:

式中，与/>代表第i个实体对中头实体的首尾特征向量，/>与/>代表第i个实体对中尾实体的首尾特征向量，上述特征向量均从token集合编码向量X_N中获取，通过构建正负样本指导模型学习医学实体对间的隐含关系，保证模型仅可以判别存在事实医学关系的实体对。In the formula, with/> Represents the head and tail feature vectors of the head entity in the i-th entity pair, /> with/> Represents the head and tail feature vectors of the tail entity in_the i-th entity pair. The above feature vectors are all obtained from the token set encoding vector Entity pairs for factual medical relationships.

8.根据权利要求4所述的一种医疗文本数据处理方法，其特征在于，所述实体对间的相对距离特征向量为：8. A medical text data processing method according to claim 4, characterized in that the relative distance feature vector between the entity pairs is:

d_i＝Linear(|s_i2-e_i1|) (3)d_i =Linear(|s_i2 -e_i1 |) (3)

9.根据权利要求1所述的一种医疗文本数据处理方法，其特征在于，对所述提取到的医学实体和医学实体关系进行遍历，去除文本过长的医学实体，将存在医学关系的实体对以{头部实体-医学关系，尾部实体}格式可视化并保存，将独立存在的医学实体以{实体类型，实体值}格式可视化并保存。9. A medical text data processing method according to claim 1, characterized in that the extracted medical entities and medical entity relationships are traversed, medical entities with too long texts are removed, and entities with medical relationships are removed. Visualize and save independent medical entities in the format of {head entity-medical relationship, tail entity}, and visualize and save independent medical entities in the format of {entity type, entity value}.

10.一种医学文本数据处理装置，其特征在于，包括：10. A medical text data processing device, characterized in that it includes:

该装置执行并实现如权利要求1至9任一所述的医疗文本数据处理方法。The device executes and implements the medical text data processing method as described in any one of claims 1 to 9.