CN113886591B

Movatterモバイル変換

Info

Publication number: CN113886591B
Application number: CN202111237852.9A
Authority: CN
Inventors: 侯良学; 王冠; 杨根科; 褚健; 王宏武
Original assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Current assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2025-06-27
Anticipated expiration: 2041-10-25
Also published as: CN113886591A

Abstract

Translated fromChinese

本发明公开了一种基于共指消除的语言模型预训练方法，涉及自然语言处理技术领域，包括如下步骤：S100、数据预处理，通过字符串匹配提取语料中的代词，处理工具提取所述语料中命名实体、名词短语，作为训练数据生成阶段的遮盖候选集合；S200、训练数据生成，通过mask_word模式和mask_phrase模式进行遮盖处理，分别生成mask_word训练数据和mask_phrase训练数据：S300、预训练，根据训练模式选择因子α^t自适应地切换word_learning模式或phrase_learning模式进行训练。本发明增加了对代词、短语、实体的语义训练，并且自适应切换学习模式，增强了模型的语义表示能力，更好地适用于共指消除任务。

The present invention discloses a language model pre-training method based on coreference elimination, which relates to the technical field of natural language processing and includes the following steps: S100, data pre-processing, extracting pronouns in a corpus by string matching, and extracting named entities and noun phrases in the corpus by a processing tool as a masking candidate set in a training data generation stage; S200, training data generation, masking processing is performed by mask_word mode and mask_phrase mode to generate mask_word training data and mask_phrase training data respectively; S300, pre-training, adaptively switching word_learning mode or phrase_learning mode for training according to a training mode selection factor α^t . The present invention increases semantic training for pronouns, phrases, and entities, and adaptively switches learning modes, thereby enhancing the semantic representation ability of the model and being better suitable for coreference elimination tasks.