CN118113867A

Movatterモバイル変換

Info

Publication number: CN118113867A
Application number: CN202410131228.8A
Authority: CN
Inventors: 史卓颖; 谭天; 胡歆玥; 邹初建; 王涛
Original assignee: Hangzhou DPTech Technologies Co Ltd
Current assignee: Hangzhou DPTech Technologies Co Ltd
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-05-31
Also published as: WO2025161305A1

Abstract

The specification provides a sensitive information identification method, a sensitive information identification device, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting an entity sequence from stream data to be identified, the entity sequence comprising at least one entity; inputting the entity sequence into a pre-trained classification model, wherein the classification model comprises a machine learning model obtained by training based on a sample entity sequence marked with whether sensitive information is contained or not; and acquiring a first identification result output by the classification model, wherein the first identification result is used for indicating whether the entity sequence contains sensitive information. According to the technical scheme, whether the data stream is the sensitive data stream is judged by judging whether an entity sequence formed by a plurality of entities in the data stream is the sensitive information or not only from the angle of a single entity, so that the identification of whether the data stream is the sensitive data stream is realized, and the identification accuracy of the sensitive information of the data stream is improved.

Description

Translated fromChinese

敏感信息识别方法、装置、电子设备及存储介质Sensitive information identification method, device, electronic device and storage medium

技术领域Technical Field

本说明书一个或多个实施例涉及流数据技术领域，尤其涉及一种敏感信息识别方法、装置、电子设备及存储介质。One or more embodiments of the present specification relate to the field of stream data technology, and in particular, to a sensitive information identification method, device, electronic device, and storage medium.

背景技术Background technique

流数据中携带了各种类型、结构的数据，这其中可能包含敏感信息。敏感信息通常可以包括但不限于个人身份信息、组织信息、地址信息等。这些敏感信息在泄露或者被未授权的访问时可能导致严重的安全和隐私问题。为了提高网络数据安全性，需要有效识别流数据中携带的敏感信息。Stream data carries data of various types and structures, which may contain sensitive information. Sensitive information can usually include but is not limited to personal identity information, organizational information, address information, etc. This sensitive information may cause serious security and privacy issues when it is leaked or accessed without authorization. In order to improve network data security, it is necessary to effectively identify sensitive information carried in stream data.

在相关技术中，通常仅从单个实体的角度判断流数据中是否携带敏感信息，导致识别结果不准确。In related technologies, whether the stream data carries sensitive information is usually judged only from the perspective of a single entity, resulting in inaccurate recognition results.

发明内容Summary of the invention

本申请提供一种敏感信息识别方法，所述方法包括：The present application provides a sensitive information identification method, the method comprising:

从待识别的流数据中提取实体序列，所述实体序列包括至少一个实体；Extracting an entity sequence from the stream data to be identified, wherein the entity sequence includes at least one entity;

将所述实体序列输入预训练的分类模型，所述分类模型包括基于被标记了是否包含敏感信息的样本实体序列进行训练而得到的机器学习模型；Inputting the entity sequence into a pre-trained classification model, wherein the classification model includes a machine learning model trained based on sample entity sequences marked as containing sensitive information;

获取所述分类模型输出的第一识别结果，所述第一识别结果用于指示所述实体序列是否包含敏感信息。A first recognition result output by the classification model is obtained, where the first recognition result is used to indicate whether the entity sequence contains sensitive information.

可选的，所述从待识别的流数据中提取实体序列，包括：Optionally, extracting the entity sequence from the stream data to be identified includes:

将所述待识别的流数据转换为文本数据；Converting the stream data to be identified into text data;

将所述文本数据输入双向长短时记忆-条件随机场BiLSTM-CRF模型，得到所述双向长短时记忆-条件随机场模型输出的所述文本数据所包括的实体；Input the text data into a bidirectional long short-term memory-conditional random field BiLSTM-CRF model to obtain entities included in the text data output by the bidirectional long short-term memory-conditional random field model;

将所述文本数据所包括的实体转换为所述实体序列。The entities included in the text data are converted into the entity sequence.

可选的，所述方法还包括：Optionally, the method further includes:

如果所述分类模型发生概念漂移，基于第二识别结果对所述实体序列进行标记，其中，所述第二识别结果用于指示所述实体序列是否包含敏感信息，所述第二识别结果的可信度高于所述第一识别结果的可信度；If concept drift occurs in the classification model, marking the entity sequence based on a second recognition result, wherein the second recognition result is used to indicate whether the entity sequence contains sensitive information, and the credibility of the second recognition result is higher than the credibility of the first recognition result;

将标记后的实体序列作为训练样本，对所述分类模型进行重新训练。The labeled entity sequences are used as training samples to retrain the classification model.

可选的，所述方法还包括：Optionally, the method further includes:

基于预设规则对所述实体序列进行识别，得到所述第二识别结果；Identify the entity sequence based on a preset rule to obtain the second identification result;

如果所述第二识别结果与所述第一识别结果不一致，则确定所述分类模型发生概念漂移。If the second recognition result is inconsistent with the first recognition result, it is determined that concept drift occurs in the classification model.

可选的，所述如果所述第二识别结果与所述第一识别结果不一致，则确定所述分类模型发生概念漂移，包括：Optionally, if the second recognition result is inconsistent with the first recognition result, determining that concept drift occurs in the classification model includes:

针对已识别的多个实体序列，如果第一类实体序列在所述多个实体序列中的占比达到第一阈值，则确定所述分类模型发生概念漂移，其中，所述第一类实体序列的第一识别结果与所述第一类实体序列的第二识别结果不一致。For multiple identified entity sequences, if a proportion of a first-category entity sequence in the multiple entity sequences reaches a first threshold, it is determined that concept drift occurs in the classification model, wherein a first recognition result of the first-category entity sequence is inconsistent with a second recognition result of the first-category entity sequence.

可选的，所述基于第二识别结果对所述实体序列进行标记，包括：Optionally, marking the entity sequence based on the second recognition result includes:

基于所述第一类实体序列的第二识别结果，对所述第一类实体序列进行标记。The first-category entity sequence is marked based on the second recognition result of the first-category entity sequence.

可选的，所述预设规则包括以下至少之一：Optionally, the preset rule includes at least one of the following:

第一识别规则，所述第一识别规则为携带预设关键字的实体序列包含敏感信息；A first identification rule, wherein the first identification rule is that an entity sequence carrying a preset keyword contains sensitive information;

第二识别规则，所述第二识别规则为属于预设IP地址的实体序列包含敏感信息；A second identification rule, wherein the second identification rule is that a sequence of entities belonging to a preset IP address contains sensitive information;

第三识别规则，所述第三识别规则为属于预设模式的实体序列包含敏感信息。A third identification rule, wherein the third identification rule is that an entity sequence belonging to a preset pattern contains sensitive information.

本申请还提供一种敏感信息识别装置，所述装置包括：The present application also provides a sensitive information identification device, the device comprising:

提取单元，用于从待识别的流数据中提取实体序列，所述实体序列包括至少一个实体；An extraction unit, configured to extract an entity sequence from the stream data to be identified, wherein the entity sequence includes at least one entity;

输入单元，将所述实体序列输入预训练的分类模型，所述分类模型包括基于被标记了是否包含敏感信息的样本实体序列进行训练而得到的机器学习模型；An input unit, inputting the entity sequence into a pre-trained classification model, wherein the classification model includes a machine learning model trained based on sample entity sequences marked as containing sensitive information;

获取单元，用于获取所述分类模型输出的第一识别结果，所述第一识别结果用于指示所述实体序列是否包含敏感信息。An acquisition unit is used to acquire a first recognition result output by the classification model, where the first recognition result is used to indicate whether the entity sequence contains sensitive information.

本申请还提供一种电子设备，包括通信接口、处理器、存储器和总线，所述通信接口、所述处理器和所述存储器之间通过总线相互连接；The present application also provides an electronic device, comprising a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are interconnected via the bus;

所述存储器中存储机器可读指令，所述处理器通过调用所述机器可读指令，执行上述方法。The memory stores machine-readable instructions, and the processor executes the above method by calling the machine-readable instructions.

本申请还提供一种机器可读存储介质，所述机器可读存储介质存储有机器可读指令，所述机器可读指令在被处理器调用和执行时，实现上述方法。The present application also provides a machine-readable storage medium, which stores machine-readable instructions. When the machine-readable instructions are called and executed by a processor, the above method is implemented.

通过以上方式，本申请从待识别的流数据中提取包括至少一个实体的实体序列，再将所述实体序列输入预训练的分类模型，然后根据所述分类模型输出的第一识别结果确定所述实体序列是否包含敏感信息。据此，相较于仅从单个实体的角度判断是否为敏感信息的方式，本说明书的技术方案判断数据流中的多个实体构成的实体序列是否为敏感信息，实现识别数据流是否为敏感数据流，从而提高数据流敏感信息的识别准确性。Through the above method, the present application extracts an entity sequence including at least one entity from the stream data to be identified, then inputs the entity sequence into a pre-trained classification model, and then determines whether the entity sequence contains sensitive information based on the first recognition result output by the classification model. Accordingly, compared with the method of judging whether it is sensitive information only from the perspective of a single entity, the technical solution of this specification judges whether the entity sequence composed of multiple entities in the data stream is sensitive information, realizes the identification of whether the data stream is a sensitive data stream, and thus improves the accuracy of identifying sensitive information in the data stream.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本说明书实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书中记载的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of this specification, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1是一示例性的实施例示出的一种敏感信息识别方法流程示意图；FIG1 is a schematic flow chart of a sensitive information identification method according to an exemplary embodiment;

图2是一示例性的实施例示出的另一种敏感信息识别方法流程示意图；FIG2 is a flow chart of another sensitive information identification method according to an exemplary embodiment;

图3是一示例性的实施例示出的一种敏感信息识别系统示意图；FIG3 is a schematic diagram of a sensitive information identification system shown in an exemplary embodiment;

图4是一示例性的实施例示出的一种按预设规则识别实体序列的示意图；FIG4 is a schematic diagram showing a method of identifying an entity sequence according to a preset rule according to an exemplary embodiment;

图5是一示例性的实施例示出的一种敏感信息识别装置所在电子设备的硬件结构图；FIG5 is a hardware structure diagram of an electronic device in which a sensitive information identification device is located, shown in an exemplary embodiment;

图6是一示例性的实施例示出的一种敏感信息识别装置的框图。FIG. 6 is a block diagram of a sensitive information identification device according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本说明书中的技术方案，下面将结合本说明书实施例中的附图，对本说明书实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本说明书一部分实施例，而不是全部的实施例。基于本说明书中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本说明书保护的范围。In order to enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below in conjunction with the drawings in the embodiments of this specification. Obviously, the described embodiments are only part of the embodiments of this specification, not all of the embodiments. Based on the embodiments in this specification, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of this specification.

需要说明的是：在其他实施例中并不一定按照本说明书示出和描述的顺序来执行相应方法的步骤。在一些其他实施例中，其方法所包括的步骤可以比本说明书所描述的更多或更少。此外，本说明书中所描述的单个步骤，在其他实施例中可能被分解为多个步骤进行描述；而本说明书中所描述的多个步骤，在其他实施例中也可能被合并为单个步骤进行描述。It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the steps included in the method may be more or less than those described in this specification. In addition, a single step described in this specification may be decomposed into multiple steps for description in other embodiments; and multiple steps described in this specification may be combined into a single step for description in other embodiments.

在相关技术中，通常仅从单个实体的角度判断流数据中是否携带敏感信息，导致识别结果不准确。例如包含个人敏感信息的流数据，“张三zhangsan@mail.com”包含“张三”和“zhangsan@mail.com”两个实体，一个实体是姓名一个实体是邮箱，这两个实体分别看不算包含敏感信息，但包含“张三zhangsan@mail.com”的流数据包含了敏感信息。In related technologies, whether the stream data contains sensitive information is usually determined only from the perspective of a single entity, resulting in inaccurate recognition results. For example, the stream data containing personal sensitive information, "Zhang San zhangsan@mail.com" contains two entities, "Zhang San" and "zhangsan@mail.com", one entity is the name and the other entity is the email address. These two entities do not contain sensitive information separately, but the stream data containing "Zhang San zhangsan@mail.com" contains sensitive information.

有鉴于此，本说明书旨在提出一种识别流数据是否包含敏感信息的技术方案。In view of this, this specification aims to propose a technical solution for identifying whether stream data contains sensitive information.

在实现时，首先从待识别的流数据中提取实体序列，所述实体序列包括至少一个实体；然后，将所述实体序列输入预训练的分类模型，所述分类模型包括基于被标记了是否包含敏感信息的样本实体序列进行训练而得到的机器学习模型；进一步地，获取所述分类模型输出的第一识别结果，所述第一识别结果用于指示所述实体序列是否包含敏感信息。During implementation, an entity sequence is first extracted from the stream data to be identified, wherein the entity sequence includes at least one entity; then, the entity sequence is input into a pre-trained classification model, wherein the classification model includes a machine learning model obtained by training based on a sample entity sequence that is marked as to whether it contains sensitive information; further, a first recognition result output by the classification model is obtained, wherein the first recognition result is used to indicate whether the entity sequence contains sensitive information.

例如，待识别的流数据包含的文本的内容为“张三的身份证是511002197****46614，邮箱是zhangsan@mail.com”，从待识别的流数据中提取实体，分别为"张三","511002197****46614","zhangsan@mail.com"，得到由实体构成的实体序列["张三","511002197****46614","zhangsan@mail.com"]。然后，将实体序列["张三","511002197****46614","zhangsan@mail.com"]输入预训练的分类模型，该分类模型是基于被标记了是否包含敏感信息的样本实体序列训练得到的机器学习模型。最后，获取分类模型输出的针对实体序列["张三","511002197****46614","zhangsan@mail.com"]的识别结果1，该识别结果代表该实体序列包含敏感信息。For example, the content of the text contained in the stream data to be identified is "Zhang San's ID card is 511002197****46614, and his email address is zhangsan@mail.com". The entities are extracted from the stream data to be identified, which are "Zhang San", "511002197****46614", "zhangsan@mail.com", and the entity sequence ["Zhang San", "511002197****46614", "zhangsan@mail.com"] consisting of entities is obtained. Then, the entity sequence ["Zhang San", "511002197****46614", "zhangsan@mail.com"] is input into the pre-trained classification model, which is a machine learning model trained based on sample entity sequences marked as containing sensitive information. Finally, the recognition result 1 for the entity sequence ["Zhang San", "511002197****46614", "zhangsan@mail.com"] output by the classification model is obtained, and the recognition result indicates that the entity sequence contains sensitive information.

在一种实施方式中，输出结果可能是每种分类结果的概率值“1的概率90％，0的概率10％”，所以分类结果为1，分类结果为1代表该实体序列包含敏感信息，分类结果为0代表该实体序列不包含敏感信息。In one embodiment, the output result may be the probability value of each classification result "90% probability of 1, 10% probability of 0", so the classification result is 1, which means that the entity sequence contains sensitive information, and the classification result is 0 which means that the entity sequence does not contain sensitive information.

由此可见，在本说明书的技术方案中，从待识别的流数据中提取包括至少一个实体的实体序列，再将所述实体序列输入预训练的分类模型，然后根据所述分类模型输出的第一识别结果确定所述实体序列是否包含敏感信息。据此，相较于仅从单个实体的角度判断是否为敏感信息的方式，本说明书的技术方案判断数据流中的多个实体构成的实体序列是否为敏感信息，实现识别数据流是否为敏感数据流，从而提高数据流敏感信息的识别准确性。It can be seen that in the technical solution of this specification, an entity sequence including at least one entity is extracted from the stream data to be identified, and then the entity sequence is input into a pre-trained classification model, and then it is determined whether the entity sequence contains sensitive information according to the first recognition result output by the classification model. Accordingly, compared with the method of judging whether it is sensitive information only from the perspective of a single entity, the technical solution of this specification judges whether the entity sequence composed of multiple entities in the data stream is sensitive information, and realizes the recognition of whether the data stream is a sensitive data stream, thereby improving the recognition accuracy of sensitive information in the data stream.

下面通过具体实施例，结合具体的应用场景对本申请进行描述。The present application is described below through specific embodiments in combination with specific application scenarios.

请参见图1，图1是一示例性的实施例示出的一种敏感信息识别方法流程示意图。所述方法可以执行以下步骤：Please refer to Figure 1, which is a schematic flow chart of a sensitive information identification method according to an exemplary embodiment. The method may perform the following steps:

步骤102：从待识别的流数据中提取实体序列，所述实体序列包括至少一个实体。Step 102: extracting an entity sequence from the stream data to be identified, wherein the entity sequence includes at least one entity.

例如，从待识别的流数据中提取实体序列["张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]，其中"张三"、"511002197****46614"、"zhangsan@mail.com"、"中赢花园1幢2单元"、"某某科技软件开发部"为一个个实体，分别代表姓名、身份证号、邮箱、地址、组织。For example, extract the entity sequence ["Zhang San", "511002197****46614", "zhangsan@mail.com", "Zhongying Garden 1 Building 2 Unit", "XX Technology Software Development Department"] from the stream data to be identified, where "Zhang San", "511002197****46614", "zhangsan@mail.com", "Zhongying Garden 1 Building 2 Unit", and "XX Technology Software Development Department" are entities, representing name, ID number, email address, address, and organization, respectively.

其中，流数据(Streaming data)是指以连续、高速、不间断的方式产生数据流。与传统的批处理数据不同，流数据是实时生成的，按逐个数据项以持续的方式传输和处理。流数据通常来自各种传感器、设备、日志文件、社交媒体等源头，流数据具有实时性和灵活性，可以用于快速响应、实时分析和即时决策等场景。在自然语言处理领域，实体通常指的是具有现实世界存在或概念属性的事物。实体可以是具体的人、地点、组织、日期、时间，或是抽象的概念、事件、产品等。实体识别可识别出文本中的关键信息，如人物姓名、地点名称、组织机构、日期时间等，从而帮助技术人员理解文本的含义，以进行更深入的文本分析和信息提取。Streaming data refers to the generation of data streams in a continuous, high-speed, and uninterrupted manner. Unlike traditional batch data, streaming data is generated in real time and is transmitted and processed in a continuous manner on a data item-by-data basis. Streaming data usually comes from various sources such as sensors, devices, log files, social media, etc. Streaming data is real-time and flexible and can be used in scenarios such as rapid response, real-time analysis, and instant decision-making. In the field of natural language processing, entities usually refer to things that have real-world existence or conceptual attributes. Entities can be specific people, places, organizations, dates, times, or abstract concepts, events, products, etc. Entity recognition can identify key information in text, such as names of people, places, organizations, dates, times, etc., to help technicians understand the meaning of the text for deeper text analysis and information extraction.

在本说明书中，可以从待识别的流数据中提取出一个或多个实体序列，比如规定每个实体序列最多包含三个实体，则可以从待识别的流数据中提取第一个实体序列["张三","511002197****46614"]和第二个实体序列["zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]，或是其他组合形式，如第一个实体序列["张三","zhangsan@mail.com"]和第二个实体序列["511002197****46614","中赢花园1幢2单元","某某科技软件开发部"]，具体组合形式和每个实体序列最多包含的实体个数可以根据实际需要做个性化调整。对于从待识别的流数据中提取出实体序列的数量和具体形式，本说明书对此并不进行限制。In this specification, one or more entity sequences can be extracted from the stream data to be identified. For example, if each entity sequence is specified to contain at most three entities, the first entity sequence ["Zhang San", "511002197****46614"] and the second entity sequence ["zhangsan@mail.com", "Zhongying Garden 1 Building 2 Unit", "XX Technology Software Development Department"] can be extracted from the stream data to be identified, or other combinations, such as the first entity sequence ["Zhang San", "zhangsan@mail.com"] and the second entity sequence ["511002197****46614", "Zhongying Garden 1 Building 2 Unit", "XX Technology Software Development Department"]. The specific combination form and the maximum number of entities contained in each entity sequence can be personalized according to actual needs. This specification does not limit the number and specific form of entity sequences extracted from the stream data to be identified.

步骤104：将所述实体序列输入预训练的分类模型，所述分类模型包括基于被标记了是否包含敏感信息的样本实体序列进行训练而得到的机器学习模型。Step 104: Input the entity sequence into a pre-trained classification model, where the classification model includes a machine learning model trained based on sample entity sequences that are marked as containing sensitive information.

例如，将实体序列["张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]输入预训练的分类模型，该分类模型是基于被标记了是否包含敏感信息的样本实体序列(如样本实体序列["张三","511002197****46614","zhangsan@mail.com"]，该样本实体序列的标记为1，含义为该样本实体序列包含敏感信息)进行训练而得到的机器学习模型。For example, the entity sequence ["张三","511002197****46614","zhangsan@mail.com","中赢花园1楼1单元2","某科技软件开发部"] is input into a pre-trained classification model, which is a machine learning model obtained by training based on sample entity sequences that are marked as to whether they contain sensitive information (such as the sample entity sequence ["张三","511002197****46614","zhangsan@mail.com"], the label of which is 1, meaning that the sample entity sequence contains sensitive information).

其中，分类模型具体可以是二分类模型，二分类模型是指一种机器学习模型，用于将输入的数据划分为两个类别中的一个。二分类模型通常被用来解决二元分类问题，例如判断邮件是否为垃圾邮件、预测股票涨跌、识别图像中是否包含特定物体等，在本说明书中，二分类模型用于识别数据流是否包含敏感信息。对于二分类模型的训练方式，具体可以采用多种机器学习算法，包括逻辑回归、支持向量机、决策树等。这些算法在训练过程中会学习输入数据的特征和模式，并且根据这些特征和模式对新的数据进行分类预测。在本说明书中，对样本实体序列进行敏感信息标记可以是标记为1代表样本实体序列包含敏感信息，标记为0代表样本实体序列不包含敏感信息，或是标记为“是”代表样本实体序列包含敏感信息，标记为“否”代表样本实体序列不包含敏感信息。关于对样本实体序列标记的具体形式，本说明书中对此并不进行限制。Among them, the classification model can specifically be a binary classification model, which refers to a machine learning model for dividing input data into one of two categories. Binary classification models are usually used to solve binary classification problems, such as determining whether an email is spam, predicting stock price fluctuations, identifying whether an image contains a specific object, etc. In this specification, the binary classification model is used to identify whether a data stream contains sensitive information. For the training method of the binary classification model, a variety of machine learning algorithms can be used, including logistic regression, support vector machine, decision tree, etc. These algorithms learn the characteristics and patterns of the input data during the training process, and classify and predict new data based on these characteristics and patterns. In this specification, the sensitive information marking of the sample entity sequence can be marked as 1 to represent that the sample entity sequence contains sensitive information, marked as 0 to represent that the sample entity sequence does not contain sensitive information, or marked as "yes" to represent that the sample entity sequence contains sensitive information, and marked as "no" to represent that the sample entity sequence does not contain sensitive information. Regarding the specific form of marking the sample entity sequence, this specification does not limit this.

在一种优选的实施方式中，为了提高分类模型构建的准确性，可以使用XGBoost(eXtreme Gradient Boosting，极致梯度提升)算法，这是基于GBDT(Gradient BoostingDecision Tree，梯度提升决策树)的一种改进算法。与传统的GBDT相比，XGBoost在算法上进行了一些改进和创新，以提高模型的准确性和效率。具体而言，XGBoost引入了正则化项，通过控制模型复杂度来防止过拟合；采用了二阶导数信息，更准确地估计残差；支持多种不同的损失函数，适应不同的应用场景等。另外，在XGBoost中，每一棵决策树都是通过梯度提升的方式构建，并且每一棵树的生成都是基于之前所有树的残差进行，以逐步优化模型的预测能力。In a preferred embodiment, in order to improve the accuracy of classification model construction, the XGBoost (eXtreme Gradient Boosting) algorithm can be used, which is an improved algorithm based on GBDT (Gradient Boosting Decision Tree). Compared with the traditional GBDT, XGBoost has made some improvements and innovations in the algorithm to improve the accuracy and efficiency of the model. Specifically, XGBoost introduces a regularization term to prevent overfitting by controlling the complexity of the model; it uses second-order derivative information to estimate the residual more accurately; it supports a variety of different loss functions to adapt to different application scenarios, etc. In addition, in XGBoost, each decision tree is constructed by gradient boosting, and the generation of each tree is based on the residuals of all previous trees to gradually optimize the predictive ability of the model.

步骤106：获取所述分类模型输出的第一识别结果，所述第一识别结果用于指示所述实体序列是否包含敏感信息。Step 106: Obtain a first recognition result output by the classification model, where the first recognition result is used to indicate whether the entity sequence contains sensitive information.

例如，获取分类模型输出的针对输入的实体序列["张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]的识别结果，为1，该识别结果表明实体序列["张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]包含敏感信息。For example, the recognition result of the classification model output for the input entity sequence ["张三","511002197****46614","zhangsan@mail.com","中赢花园1楼2单元","某科技软件开发部"] is 1. The recognition result indicates that the entity sequence ["张三","511002197****46614","zhangsan@mail.com","中赢花园1楼2单元","某科技软件开发部"] contains sensitive information.

其中，分类模型在输出识别结果时，通常会输出一个属于某个类别的概率值或者一个二进制的标签。概率值表示输入数据属于该类别的可能性，概率值越高，表示模型认为输入数据属于该类别的可能性越大。例如，模型可以输出一个介于0和1之间的概率值，比如0.75，表示输入数据属于该类别的概率为75％。二进制的标签表示输入数据被划分到哪个类别，通常可以将模型输出的概率值与一个阈值进行比较，如果概率值大于等于阈值，则输出正类别的标签(例如1)，否则输出负类别的标签(例如0)。关于分类模型输出识别结果的具体形式，具体取决于具体的问题和应用场景，如果需要更细粒度的结果，可以使用概率输出；如果只需要简单的分类结果，可以使用二进制标签输出，本说明书中对此并不进行限制。Among them, when the classification model outputs the recognition result, it usually outputs a probability value belonging to a certain category or a binary label. The probability value indicates the possibility that the input data belongs to the category. The higher the probability value, the greater the possibility that the model believes that the input data belongs to the category. For example, the model can output a probability value between 0 and 1, such as 0.75, indicating that the probability that the input data belongs to the category is 75%. The binary label indicates which category the input data is divided into. The probability value output by the model can usually be compared with a threshold. If the probability value is greater than or equal to the threshold, the label of the positive category (for example, 1) is output, otherwise the label of the negative category (for example, 0) is output. Regarding the specific form of the classification model output recognition results, it depends on the specific problem and application scenario. If a more fine-grained result is required, a probability output can be used; if only a simple classification result is required, a binary label output can be used. This is not limited in this specification.

在示出的一种实施方式中，所述从待识别的流数据中提取实体序列，包括：将所述待识别的流数据转换为文本数据；将所述文本数据输入双向长短时记忆-条件随机场BiLSTM-CRF模型，得到所述双向长短时记忆-条件随机场模型输出的所述文本数据所包括的实体；将所述文本数据所包括的实体转换为所述实体序列。In one embodiment shown, the extracting entity sequence from the stream data to be identified includes: converting the stream data to be identified into text data; inputting the text data into a bidirectional long short-term memory-conditional random field BiLSTM-CRF model to obtain entities included in the text data output by the bidirectional long short-term memory-conditional random field model; and converting the entities included in the text data into the entity sequence.

例如，获取用于传输待识别的流数据的流量包；在所述流量包中提取文本数据，如“张三的身份证是511002197****46614，邮箱是zhangsan@mail.com，住址在中赢花园1幢2单元，组织关系在某某科技软件开发部”，然后需要对该文本数据做实体识别，可以将该文本数据输入BiLSTM-CRF(Bidirectional Long Short-Term Memory Conditional RandomField，双向长短时记忆条件随机场模型)模型，得到BiLSTM-CRF模型输出的文本数据包括的实体"张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"，再把文本数据包括的实体转换为实体序列["张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]。For example, a traffic packet for transmitting the flow data to be identified is obtained; text data is extracted from the traffic packet, such as "Zhang San's ID card is 511002197****46614, his email address is zhangsan@mail.com, his address is Unit 2, Building 1, Zhongying Garden, and his organizational relationship is in the Software Development Department of XXXX Technology", and then entity recognition is required for the text data. The text data can be input into a BiLSTM-CRF (Bidirectional Long Short-Term Memory Conditional RandomField) model to obtain the entities "Zhang San", "511002197****46614", "zhangsan@mail.com", "Unit 2, Building 1, Zhongying Garden", and "Software Development Department of XXXX Technology" included in the text data output by the BiLSTM-CRF model, and then the entities included in the text data are converted into entity sequences ["Zhang San", "511002197****46614", "zhangsan@mail.com", "Unit 2, Building 1, Zhongying Garden", and "Software Development Department of XXXX Technology"].

其中，可以使用网络抓包工具来捕获网络流量包。以获取HTTP协议的流量包为例，需要设置过滤器以仅捕获HTTP协议的流量包，并提取流量包中文本层部分的文本信息，以及对于其他协议层的无实体对象的数据做过滤，以将网络流量中的流数据转换为有效的文本信息。BiLSTM-CRF模型是一种用于序列标注任务的深度学习模型，结合BiLSTM(Bidirectional Long Short-Term Memory，双向长短时记忆神经网络)和CRF(Conditional Random Field，条件随机场)两个组件，通过联合训练BiLSTM和CRF，可以自动捕捉上下文信息和词位标记法的标注之间的依赖关系，从而产生准确且连贯的标注结果。Among them, you can use a network packet capture tool to capture network traffic packets. Taking the traffic packets of the HTTP protocol as an example, you need to set a filter to capture only the traffic packets of the HTTP protocol, extract the text information of the text layer part of the traffic packet, and filter the data of the non-entity objects of other protocol layers to convert the flow data in the network traffic into valid text information. The BiLSTM-CRF model is a deep learning model for sequence labeling tasks. It combines the two components of BiLSTM (Bidirectional Long Short-Term Memory) and CRF (Conditional Random Field). By jointly training BiLSTM and CRF, it can automatically capture the dependency between context information and the annotation of the word position tagging method, thereby producing accurate and coherent annotation results.

在将文本数据包括的实体转换为实体序列的过程中，可以根据实体序列最多包含的实体个数，将实体转化出多种实体序列的排列组合形式。比如，规定实体序列最多包含的实体个数为3，则实体"张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"可以转化出两条实体序列，如["张三","511002197****46614"]和["zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"]，或是["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]等多种组合形式，并在后续对这两条实体序列分别进行敏感信息识别，如果任一条识别为包含敏感信息，则认为提取出实体"张三","511002197****46614","zhangsan@mail.com","中赢花园1幢2单元","某某科技软件开发部"的流数据包含敏感信息。对于实体序列最多包含的实体个数，以及实体转化出实体序列的排列组合规则，本说明书对此并不进行限制。In the process of converting entities included in text data into entity sequences, the entities can be converted into various permutations and combinations of entity sequences according to the maximum number of entities contained in the entity sequence. For example, if the maximum number of entities contained in the entity sequence is 3, then the entity "张三", "511002197****46614", "zhangsan@mail.com", "中赢花园1楼2单元", "某科技软件开发部" can be converted into two entity sequences, such as ["张三", "511002197****46614"] and ["zhangsan@mail.com", "中赢花园1楼2单元", "某科技软件开发部"], or ["张三", "511002197****46614"] and ["zhangsan@mail.com", "中赢花园1楼2单元", "某科技软件开发部"]. 2197****46614","zhangsan@mail.com"] and ["Zhongying Garden 1 Building 2 Unit","XXX Technology Software Development Department"] and other combinations, and then identify the sensitive information of these two entity sequences respectively. If any one is identified as containing sensitive information, it is considered that the extracted entity "Zhang San", "511002197****46614","zhangsan@mail.com","Zhongying Garden 1 Building 2 Unit","XXX Technology Software Development Department" stream data contains sensitive information. This specification does not limit the maximum number of entities contained in the entity sequence and the permutation and combination rules of the entity sequence.

在示出的一种实施方式中，所述方法还包括：如果所述分类模型发生概念漂移，基于第二识别结果对所述实体序列进行标记，其中，所述第二识别结果用于指示所述实体序列是否包含敏感信息，所述第二识别结果的可信度高于所述第一识别结果的可信度；将标记后的实体序列作为训练样本，对所述分类模型进行重新训练。In one embodiment shown, the method also includes: if concept drift occurs in the classification model, marking the entity sequence based on a second recognition result, wherein the second recognition result is used to indicate whether the entity sequence contains sensitive information, and the credibility of the second recognition result is higher than the credibility of the first recognition result; using the marked entity sequence as a training sample to retrain the classification model.

例如，在预训练的分类模型发生概念漂移的情况下，需要将输入预训练的分类模型的实体序列["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]等进行可信度更高的识别，比如基于专家经验进行识别，得到识别结果均为1，即代表实体序列["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]均包含敏感信息。然后，根据识别结果，对实体序列进行标记，得到标记均为1的实体序列["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]，再用标记后的实体序列作为训练样本，重新训练分类模型。For example, when the pre-trained classification model has concept drift, it is necessary to identify the entity sequences ["张三","511002197****46614","zhangsan@mail.com"] and ["中赢花园1楼2单元","某科技软件开发部"] input into the pre-trained classification model with higher credibility. For example, if the identification is based on expert experience and the identification results are all 1, it means that the entity sequences ["张三","511002197****46614","zhangsan@mail.com"] and ["中赢花园1楼2单元","某科技软件开发部"] all contain sensitive information. Then, based on the identification results, the entity sequences are marked to obtain entity sequences ["张三","511002197****46614","zhangsan@mail.com"] and ["中赢花园1楼2单元","某科技软件开发部"] with all the marks 1, and then use the marked entity sequences as training samples to retrain the classification model.

请参见图2，图2是一示例性的实施例示出的另一种敏感信息识别方法流程示意图。如图2所示，先接收待识别的流数据，将待识别的流数据转换为文本，在文本中识别出实体，将实体转化为实体序列，然后把实体序列输入预训练的分类模型，该分类模型在发生概念漂移后能够进行自适应调整，最后得到分类模型输出的关于实体序列是否包含敏感信息的识别结果。Please refer to Figure 2, which is a flowchart of another sensitive information identification method shown in an exemplary embodiment. As shown in Figure 2, first receive the stream data to be identified, convert the stream data to be identified into text, identify entities in the text, convert the entities into entity sequences, and then input the entity sequences into a pre-trained classification model, which can be adaptively adjusted after concept drift occurs, and finally obtain the identification result of whether the entity sequence contains sensitive information output by the classification model.

其中，在预测分析和机器学习中，概念漂移表示目标变量的统计特性随着时间的推移以不可预见的方式变化的现象。随着时间的推移，基于机器学习搭建的用于检测敏感信息的分类模型面临一定阶段后预测能力下降，鲁棒性差，误报漏报多的问题，使得分类模型的预测精度降低。因此需要识别分类模型概念漂移的发生并基于新的样本进行重新训练。In predictive analysis and machine learning, concept drift refers to the phenomenon that the statistical characteristics of the target variable change in unpredictable ways over time. As time goes by, the classification model built based on machine learning for detecting sensitive information faces the problem of decreased prediction ability, poor robustness, and many false positives and negatives after a certain stage, which reduces the prediction accuracy of the classification model. Therefore, it is necessary to identify the occurrence of concept drift in the classification model and retrain it based on new samples.

需要将输入预训练的分类模型的实体序列进行可信度更高的识别并标记，以得到新的样本。可信度更高的识别方式包括但不限于对实体序列进行简单直接的关键词匹配、专家人工经验判别等。对于预训练的分类模型在发生概念漂移后新的实体序列样本获取方式，本说明书中对此并不进行限制。It is necessary to identify and mark the entity sequence input into the pre-trained classification model with higher credibility to obtain new samples. The identification method with higher credibility includes but is not limited to simple and direct keyword matching of the entity sequence, expert manual experience judgment, etc. The method of obtaining new entity sequence samples after concept drift occurs in the pre-trained classification model is not limited in this specification.

在示出的一种实施方式中，所述方法还包括：基于预设规则对所述实体序列进行识别，得到所述第二识别结果；如果所述第二识别结果与所述第一识别结果不一致，则确定所述分类模型发生概念漂移。In one illustrated embodiment, the method further includes: identifying the entity sequence based on a preset rule to obtain the second recognition result; if the second recognition result is inconsistent with the first recognition result, determining that concept drift occurs in the classification model.

例如，基于专家人工经验对输入预训练的分类模型的实体序列["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]进行识别，得到识别结果均为1，但预训练的分类模型对实体序列["张三","511002197****46614","zhangsan@mail.com"]和["中赢花园1幢2单元","某某科技软件开发部"]的识别结果为1和0，则确定预训练的分类模型发生概念漂移。For example, based on expert manual experience, the entity sequences ["张三","511002197****46614","zhangsan@mail.com"] and ["中赢花园1楼2单元","某科技软件开发部"] input into the pre-trained classification model are recognized, and the recognition results are both 1. However, the pre-trained classification model recognizes the entity sequences ["张三","511002197****46614","zhangsan@mail.com"] and ["中赢花园1楼2单元","某科技软件开发部"] as 1 and 0, then it is determined that the pre-trained classification model has undergone concept drift.

其中，预设规则代表相对准确的判别规则，在本实施例中用以识别分类模型发生概念漂移，预设规则可以是专家人工经验判断，或是比对实体序列中关键词信息，对于预设规则的具体形式，本说明书中对此并不进行限制。Among them, the preset rules represent relatively accurate discrimination rules, which are used to identify concept drift in the classification model in this embodiment. The preset rules can be expert manual experience judgment, or comparison of keyword information in the entity sequence. The specific form of the preset rules is not limited in this specification.

在本实施例中，预训练的分类模型输出的对实体序列的第一识别结果可以包含一个或多个实体序列的识别结果，预设规则输出的对实体序列的第二识别结果可以包含一个或多个实体序列的识别结果。第二识别结果与第一识别结果不一致，可以表示一个实体序列的第二识别结果与该实体序列的第一识别结果不一致，也可以表示若干实体序列的第二识别结果与该若干实体序列的第一识别结果的差异结果在全部结果中的占比达到一定程度时，认定为二者识别结果不一致。In this embodiment, the first recognition result of the entity sequence output by the pre-trained classification model may include the recognition results of one or more entity sequences, and the second recognition result of the entity sequence output by the preset rule may include the recognition results of one or more entity sequences. The second recognition result is inconsistent with the first recognition result, which may indicate that the second recognition result of an entity sequence is inconsistent with the first recognition result of the entity sequence, or that when the difference between the second recognition results of several entity sequences and the first recognition results of the several entity sequences accounts for a certain proportion of all the results, the two recognition results are deemed to be inconsistent.

在示出的一种实施方式中，所述如果所述第二识别结果与所述第一识别结果不一致，则确定所述分类模型发生概念漂移，包括：针对已识别的多个实体序列，如果第一类实体序列在所述多个实体序列中的占比达到第一阈值，则确定所述分类模型发生概念漂移，其中，所述第一类实体序列的第一识别结果与所述第一类实体序列的第二识别结果不一致。In one embodiment shown, if the second recognition result is inconsistent with the first recognition result, it is determined that concept drift has occurred in the classification model, including: for multiple identified entity sequences, if the proportion of first-category entity sequences in the multiple entity sequences reaches a first threshold, it is determined that concept drift has occurred in the classification model, wherein the first recognition result of the first-category entity sequence is inconsistent with the second recognition result of the first-category entity sequence.

例如，第一阈值为33％，针对已识别的三个实体序列["张三","511002197****46614","zhangsan@mail.com"]、["中赢花园1幢2单元","某某科技软件开发部"]、["晴天","性别女"]，分类模型输出的第一识别结果分别为1、1、1，即三个实体序列均包含敏感信息，预设规则对这三个实体序列识别得到的第二识别结果分别为1、1、0，即前两个实体序列包含敏感信息，而最后一个实体序列不包含敏感信息。可见，针对实体序列["晴天","性别女"]，通过分类模型得到的第一识别结果和通过预设规则得到的第二识别结果不一致，则实体序列["晴天","性别女"]作为第一类实体序列在已识别的三个实体序列中的占比为33％，达到了预设的第一阈值，此时可以确定分类模型发生了概念漂移。For example, the first threshold is 33%. For the three entity sequences that have been identified ["张三","511002197****46614","zhangsan@mail.com"], ["中赢花园1楼2单元","某科技软件开发部"], and ["晴天","性别女"], the first recognition results output by the classification model are 1, 1, and 1, respectively, that is, the three entity sequences all contain sensitive information. The second recognition results obtained by the preset rules for the three entity sequences are 1, 1, and 0, respectively, that is, the first two entity sequences contain sensitive information, while the last entity sequence does not contain sensitive information. It can be seen that for the entity sequence ["晴天","性别女"], the first recognition result obtained by the classification model and the second recognition result obtained by the preset rules are inconsistent. Then, the entity sequence ["晴天","性别女"] as the first type of entity sequence accounts for 33% of the three entity sequences that have been identified, reaching the preset first threshold. At this time, it can be determined that the classification model has undergone concept drift.

其中，除了识别结果不一致的实体序列在已识别的实体序列中的占比达到第一阈值时，可以判定分类模型发生概念漂移之外，还可以根据第一识别结果与第二识别结果一致的实体序列在已识别的实体序列中的占比达到第一阈值时，判定分类模型发生概念漂移。此外，不仅可以将差异结果以百分比的形式进行设置，还可以将差异结果转化成置信区间的形式进行判别，或是在判定出现概念漂移前先对可能发生的概念漂移进行预警，如果错误率的置信度达到95％，则认为分类模型可能发生概念漂移；如错误率的置信度达到99％，则判定分类模型发生概念漂移。关于概念漂移在差异结果上的概率占比和概率的具体形式，本说明书对此并不进行限制。Among them, in addition to determining that the classification model has undergone concept drift when the proportion of entity sequences with inconsistent recognition results in the recognized entity sequences reaches a first threshold, the classification model can also be determined to have undergone concept drift when the proportion of entity sequences with consistent first recognition results and second recognition results in the recognized entity sequences reaches a first threshold. In addition, not only can the difference results be set in the form of percentages, but the difference results can also be converted into the form of confidence intervals for discrimination, or a warning of possible concept drift can be issued before determining that concept drift has occurred. If the confidence level of the error rate reaches 95%, it is considered that the classification model may have undergone concept drift; if the confidence level of the error rate reaches 99%, it is determined that the classification model has undergone concept drift. This specification does not limit the probability percentage of concept drift in the difference results and the specific form of probability.

在示出的一种实施方式中，所述基于第二识别结果对所述实体序列进行标记，包括：基于所述第一类实体序列的第二识别结果，对所述第一类实体序列进行标记。In an illustrated embodiment, the marking the entity sequence based on the second recognition result includes: marking the first-category entity sequence based on the second recognition result of the first-category entity sequence.

例如，请参见图3，图3是一示例性的实施例示出的一种敏感信息识别系统示意图。如图3所示，在分类模型输出对实体序列的识别结果(即第一识别结果)后，预设规则会对已被分类模型识别的实体序列进行二次识别，得到第二识别结果，如果第一识别结果和第二识别结果进行比较，得出确认分类模型发生概念漂移的结论后，会基于预设规则构成的数据集得到的第二识别结果，对发生概念漂移的分类模型输入的实体序列数据集进行标记，再基于标记后的实体序列数据集和其他数据集，如行业标准数据集等，来对分类模型进行主动学习，使得分类模型基于新的训练样本做自适应调整。For example, please refer to Figure 3, which is a schematic diagram of a sensitive information identification system shown in an exemplary embodiment. As shown in Figure 3, after the classification model outputs the recognition result of the entity sequence (i.e., the first recognition result), the preset rule will perform a second recognition on the entity sequence that has been recognized by the classification model to obtain a second recognition result. If the first recognition result is compared with the second recognition result, and a conclusion is drawn that the classification model has a concept drift, the entity sequence data set input to the classification model that has a concept drift will be marked based on the second recognition result obtained from the data set composed of the preset rules, and then the classification model is actively learned based on the marked entity sequence data set and other data sets, such as industry standard data sets, so that the classification model is adaptively adjusted based on the new training samples.

其中，预设规则数据集可以包含专家经验数据集，即专家对敏感信息进行判断的人工规则。其他数据集可以是其他可被信赖的数据集，如公开数据集、学术研究数据集等，对于预设规则和其他数据集的具体形式，本说明书中对此并不进行限制。The preset rule data set may include an expert experience data set, i.e., an artificial rule for experts to judge sensitive information. Other data sets may be other reliable data sets, such as public data sets, academic research data sets, etc. The specific forms of the preset rules and other data sets are not limited in this specification.

在本实施例中，为了提高分类模型的自适应训练效果，使得训练后的分类模型识别实体序列中包含敏感信息的准确度更高，可以根据第二识别结果，对第一识别结果与第二识别结果不一致的第一类实体序列全部进行标记。In this embodiment, in order to improve the adaptive training effect of the classification model and make the trained classification model more accurate in identifying entity sequences containing sensitive information, all first-category entity sequences whose first recognition results are inconsistent with the second recognition results can be marked according to the second recognition result.

在本实施例中，为了减少标记的工作量，还可以仅标记确认分类模型发生概念漂移时以及先前的一段阈值内输入分类模型的实体序列。具体的，可以对第一识别结果和第二识别结果的差异占比达到第二阈值后输入预训练的分类模型的实体序列进行标记，其中，所述第二阈值小于所述第一阈值，如第二阈值为20％，第一阈值为33％。In this embodiment, in order to reduce the workload of marking, it is also possible to mark only the entity sequence input into the classification model when the classification model is confirmed to have concept drift and within a previous threshold. Specifically, the entity sequence input into the pre-trained classification model after the difference ratio between the first recognition result and the second recognition result reaches a second threshold can be marked, wherein the second threshold is less than the first threshold, such as the second threshold is 20% and the first threshold is 33%.

在示出的一种实施方式中，所述预设规则包括以下至少之一：第一识别规则，所述第一识别规则为携带预设关键字的实体序列包含敏感信息；第二识别规则，所述第二识别规则为属于预设IP地址的实体序列包含敏感信息；第三识别规则，所述第三识别规则为属于预设模式的实体序列包含敏感信息。In one embodiment shown, the preset rules include at least one of the following: a first identification rule, wherein the first identification rule is that an entity sequence carrying a preset keyword contains sensitive information; a second identification rule, wherein the second identification rule is that an entity sequence belonging to a preset IP address contains sensitive information; and a third identification rule, wherein the third identification rule is that an entity sequence belonging to a preset pattern contains sensitive information.

例如，请参见图4，图4是一示例性的实施例示出的一种按预设规则识别实体序列的示意图。如图4所示，实体序列输入按预设规则建模得到的模型，该模型通过对预设规则建模得到；其中，预设规则包含第一识别规则，即根据识别到的实体序列是否携带预设关键字来判断实体序列是否包含敏感信息；第二识别规则，即根据识别到的实体序列是否属于预设的IP地址来判断实体序列是否包含敏感信息；第三识别规则，即根据识别到的实体序列是否属于预设模式来判断实体序列是否包含敏感信息。然后，预设规则模型输出对实体序列是否包含敏感信息的第二识别结果。在确定好对实体序列的预设规则后，用代码的方式实现预设规则，实现后的预设规则可以识别实体序列是否包含敏感信息，从而得到关于实体序列的第二识别结果。For example, please refer to Figure 4, which is a schematic diagram of identifying an entity sequence according to a preset rule shown in an exemplary embodiment. As shown in Figure 4, the entity sequence input is a model obtained by modeling according to the preset rule, and the model is obtained by modeling the preset rule; wherein the preset rule includes a first recognition rule, that is, judging whether the entity sequence contains sensitive information according to whether the recognized entity sequence carries a preset keyword; a second recognition rule, that is, judging whether the entity sequence contains sensitive information according to whether the recognized entity sequence belongs to a preset IP address; a third recognition rule, that is, judging whether the entity sequence contains sensitive information according to whether the recognized entity sequence belongs to a preset mode. Then, the preset rule model outputs a second recognition result of whether the entity sequence contains sensitive information. After determining the preset rule for the entity sequence, the preset rule is implemented in a code manner. The implemented preset rule can identify whether the entity sequence contains sensitive information, thereby obtaining a second recognition result about the entity sequence.

其中，对关键字的判别可以通过事先定义一份包含敏感词汇的列表，这些词汇通常与敏感信息或敏感主题相关联。例如，在涉及金融行业的场景中，银行、账户、贷款、个人信息等词汇可能被定义为敏感词汇，如银行账户名和密码。此外，要识别实体序列是否属于预设的IP地址，可以使用正则表达式进行匹配，通常来说，IP地址中不合法地址的特点是任意段三位数大于等于255。预设模式的实体序列中，可能包含敏感信息的实体序列包括但不限于：1.敏感组织或机构的名称；2.涉及隐私保护的地址或位置信息，例如住址、办公地址、敏感场所等；3.个人身份证号、护照号等敏感信息；4.银行卡号或支付账户；5.涉及隐私保护的疾病名称或药品名称；6.出生日期、家庭成员信息、工资收入等敏感信息。上述信息中的相当部分具有固定的模式，如18位的身份证号前6位为地址码，第7至14位为出生日期码，第15至17位为顺序码，第18位为校验码，它们共同构成一个人的唯一识别码。Among them, the discrimination of keywords can be done by pre-defining a list containing sensitive words, which are usually associated with sensitive information or sensitive topics. For example, in scenarios involving the financial industry, words such as bank, account, loan, personal information, etc. may be defined as sensitive words, such as bank account name and password. In addition, to identify whether the entity sequence belongs to the preset IP address, a regular expression can be used for matching. Generally speaking, the characteristic of an illegal address in an IP address is that any segment of three-digit numbers is greater than or equal to 255. In the entity sequence of the preset mode, the entity sequence that may contain sensitive information includes but is not limited to: 1. The name of a sensitive organization or institution; 2. Address or location information involving privacy protection, such as residential address, office address, sensitive places, etc.; 3. Sensitive information such as personal ID number, passport number, etc.; 4. Bank card number or payment account; 5. Disease name or drug name involving privacy protection; 6. Sensitive information such as date of birth, family member information, salary income, etc. A considerable portion of the above information has a fixed pattern. For example, the first 6 digits of the 18-digit ID number are the address code, the 7th to 14th digits are the date of birth code, the 15th to 17th digits are the sequence code, and the 18th digit is the check code, which together constitute a person's unique identification code.

与上述敏感信息识别方法的实施例对应的，本说明书还提供了一种敏感信息识别装置的实施例。Corresponding to the above-mentioned embodiment of the sensitive information identification method, this specification also provides an embodiment of a sensitive information identification device.

请参见图5，图5是一示例性的实施例示出的一种敏感信息识别装置所在电子设备的硬件结构图。在硬件层面，该设备包括处理器502、内部总线504、网络接口506、内存508以及非易失性存储器510，当然还可能包括其他所需要的硬件。本说明书一个或多个实施例可以基于软件方式来实现，比如由处理器502从非易失性存储器510中读取对应的计算机程序到内存508中然后运行。当然，除了软件实现方式之外，本说明书一个或多个实施例并不排除其他实现方式，比如逻辑器件抑或软硬件结合的方式等等，也就是说以下处理流程的执行主体并不限定于各个逻辑单元，也可以是硬件或逻辑器件。Please refer to Figure 5, which is a hardware structure diagram of an electronic device in which a sensitive information identification device is located, shown in an exemplary embodiment. At the hardware level, the device includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile memory 510, and of course may also include other required hardware. One or more embodiments of this specification can be implemented based on software, such as the processor 502 reading the corresponding computer program from the non-volatile memory 510 into the memory 508 and then running it. Of course, in addition to software implementation, one or more embodiments of this specification do not exclude other implementation methods, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

请参见图6，图6是一示例性的实施例示出的一种敏感信息识别装置的框图。该敏感信息识别装置可以应用于如图5所示的电子设备中，以实现本说明书的技术方案。所述装置可以包括：Please refer to FIG. 6 , which is a block diagram of a sensitive information identification device shown in an exemplary embodiment. The sensitive information identification device can be applied to the electronic device shown in FIG. 5 to implement the technical solution of this specification. The device may include:

提取单元602，用于从待识别的流数据中提取实体序列，所述实体序列包括至少一个实体；An extraction unit 602, configured to extract an entity sequence from the stream data to be identified, wherein the entity sequence includes at least one entity;

输入单元604，将所述实体序列输入预训练的分类模型，所述分类模型包括基于被标记了是否包含敏感信息的样本实体序列进行训练而得到的机器学习模型；An input unit 604 inputs the entity sequence into a pre-trained classification model, where the classification model includes a machine learning model trained based on sample entity sequences marked as containing sensitive information;

获取单元606，用于获取所述分类模型输出的第一识别结果，所述第一识别结果用于指示所述实体序列是否包含敏感信息。The acquisition unit 606 is used to acquire a first recognition result output by the classification model, where the first recognition result is used to indicate whether the entity sequence contains sensitive information.

在本实施例中，所述提取单元，包括：In this embodiment, the extraction unit includes:

第一转换子单元，用于确定数据更新时刻大于所述第一时刻并且小于或等于所述第三时刻的数据为增量数据；A first conversion subunit, configured to determine that data whose update time is greater than the first time and less than or equal to the third time is incremental data;

输入子单元，用于将所述文本数据输入双向长短时记忆-条件随机场BiLSTM-CRF模型，得到所述双向长短时记忆-条件随机场模型输出的所述文本数据所包括的实体；An input subunit, used for inputting the text data into a bidirectional long short-term memory-conditional random field BiLSTM-CRF model to obtain entities included in the text data output by the bidirectional long short-term memory-conditional random field model;

第二转换子单元，用于将所述文本数据所包括的实体转换为所述实体序列。The second conversion subunit is used to convert the entities included in the text data into the entity sequence.

在本实施例中，所述装置还包括：In this embodiment, the device further includes:

标记单元，用于在所述分类模型发生概念漂移的情况下，基于第二识别结果对所述实体序列进行标记，其中，所述第二识别结果用于指示所述实体序列是否包含敏感信息，所述第二识别结果的可信度高于所述第一识别结果的可信度；a marking unit, configured to mark the entity sequence based on a second recognition result when concept drift occurs in the classification model, wherein the second recognition result is used to indicate whether the entity sequence contains sensitive information, and the credibility of the second recognition result is higher than that of the first recognition result;

训练单元，用于将标记后的实体序列作为训练样本，对所述分类模型进行重新训练。The training unit is used to use the labeled entity sequence as a training sample to retrain the classification model.

识别单元，用于基于预设规则对所述实体序列进行识别，得到所述第二识别结果；an identification unit, configured to identify the entity sequence based on a preset rule to obtain the second identification result;

确定单元，用于在所述第二识别结果与所述第一识别结果不一致的情况下，确定所述分类模型发生概念漂移。A determining unit is used to determine that concept drift occurs in the classification model when the second recognition result is inconsistent with the first recognition result.

在本实施例中，所述确定单元，包括：In this embodiment, the determining unit includes:

确定子单元，用于针对已识别的多个实体序列，如果第一类实体序列在所述多个实体序列中的占比达到第一阈值，则确定所述分类模型发生概念漂移，其中，所述第一类实体序列的第一识别结果与所述第一类实体序列的第二识别结果不一致。A determination subunit is used to determine, for multiple identified entity sequences, that concept drift occurs in the classification model if a proportion of a first type of entity sequence in the multiple entity sequences reaches a first threshold, wherein a first recognition result of the first type of entity sequence is inconsistent with a second recognition result of the first type of entity sequence.

在本实施例中，所述标记单元，包括：In this embodiment, the marking unit includes:

标记子单元，用于基于所述第一类实体序列的第二识别结果，对所述第一类实体序列进行标记。The marking subunit is used to mark the first-category entity sequence based on the second recognition result of the first-category entity sequence.

在本实施例中，所述预设规则包括以下至少之一：In this embodiment, the preset rule includes at least one of the following:

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。The implementation process of the functions and effects of each unit in the above-mentioned device is specifically described in the implementation process of the corresponding steps in the above-mentioned method, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例只是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本说明书方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can refer to the partial description of the method embodiments. The device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this specification. Ordinary technicians in this field can understand and implement it without paying creative work.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机，计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer, which may be in the form of a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, a game console, a tablet computer, a wearable device or a combination of any of these devices.

在一个典型的配置中，计算机包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computer includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。The memory may include non-permanent storage in a computer-readable medium, random access memory (RAM) and/or non-volatile memory in the form of read-only memory (ROM) or flash RAM. The memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer readable media include permanent and non-permanent, removable and non-removable media that can be used to store information by any method or technology. Information can be computer readable instructions, data structures, program modules or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include temporary computer readable media (transitory media), such as modulated data signals and carrier waves.

本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，并且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准，并提供有相应的操作入口，供用户选择授权或者拒绝。The user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of relevant countries and regions, and provide corresponding operation entrances for users to choose to authorize or refuse.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, commodity or device. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the presence of other identical elements in the process, method, commodity or device including the element.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The above is a description of a specific embodiment of the specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in an order different from that in the embodiments and still achieve the desired results. In addition, the processes depicted in the accompanying drawings do not necessarily require the specific order or continuous order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in one or more embodiments of this specification are only for the purpose of describing specific embodiments, and are not intended to limit one or more embodiments of this specification. The singular forms of "a", "said" and "the" used in one or more embodiments of this specification and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.

应当理解，尽管在本说明书一个或多个实施例可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本说明书一个或多个实施例范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used to describe various information in one or more embodiments of this specification, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of one or more embodiments of this specification, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".

以上所述仅为本说明书一个或多个实施例的较佳实施例而已，并不用以限制本说明书一个或多个实施例，凡在本说明书一个或多个实施例的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本说明书一个或多个实施例保护的范围之内。The above description is merely a preferred embodiment of one or more embodiments of the present specification and is not intended to limit one or more embodiments of the present specification. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of one or more embodiments of the present specification shall be included in the scope of protection of one or more embodiments of the present specification.

Claims

Translated fromChinese

1.一种敏感信息识别方法，其特征在于，所述方法包括：1. A method for identifying sensitive information, characterized in that the method comprises:

2.根据权利要求1所述的方法，其特征在于，所述从待识别的流数据中提取实体序列，包括：2. The method according to claim 1, characterized in that extracting the entity sequence from the stream data to be identified comprises:

3.根据权利要求1所述的方法，其特征在于，所述方法还包括：3. The method according to claim 1, characterized in that the method further comprises:

将标记后的实体序列作为训练样本，对所述分类模型进行重新训练。The labeled entity sequence is used as a training sample to retrain the classification model.

4.根据权利要求3所述的方法，其特征在于，所述方法还包括：4. The method according to claim 3, characterized in that the method further comprises:

5.根据权利要求4所述的方法，其特征在于，所述如果所述第二识别结果与所述第一识别结果不一致，则确定所述分类模型发生概念漂移，包括：5. The method according to claim 4, characterized in that if the second recognition result is inconsistent with the first recognition result, determining that concept drift occurs in the classification model comprises:

针对已识别的多个实体序列，如果第一类实体序列在所述多个实体序列中的占比达到第一阈值，则确定所述分类模型发生概念漂移，其中，所述第一类实体序列的第一识别结果与所述第一类实体序列的第二识别结果不一致。For multiple identified entity sequences, if the proportion of a first-category entity sequence in the multiple entity sequences reaches a first threshold, it is determined that concept drift occurs in the classification model, wherein a first recognition result of the first-category entity sequence is inconsistent with a second recognition result of the first-category entity sequence.

6.根据权利要求3所述的方法，其特征在于，所述基于第二识别结果对所述实体序列进行标记，包括：6. The method according to claim 3, characterized in that the step of marking the entity sequence based on the second recognition result comprises:

7.根据权利要求4所述的方法，其特征在于，所述预设规则包括以下至少之一：7. The method according to claim 4, characterized in that the preset rule comprises at least one of the following:

8.一种敏感信息识别装置，其特征在于，所述装置包括：8. A sensitive information identification device, characterized in that the device comprises:

9.一种电子设备，包括通信接口、处理器、存储器和总线，所述通信接口、所述处理器和所述存储器之间通过总线相互连接；9. An electronic device, comprising a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are interconnected via the bus;

所述存储器中存储机器可读指令，所述处理器通过调用所述机器可读指令，执行权利要求1至7任一项所述的方法。The memory stores machine-readable instructions, and the processor executes the method according to any one of claims 1 to 7 by calling the machine-readable instructions.

10.一种机器可读存储介质，所述机器可读存储介质存储有机器可读指令，所述机器可读指令在被处理器调用和执行时，实现权利要求1至7任一项所述的方法。10. A machine-readable storage medium storing machine-readable instructions, wherein the machine-readable instructions, when called and executed by a processor, implement the method according to any one of claims 1 to 7.