CN117131159A

Movatterモバイル変換

Info

Publication number: CN117131159A
Application number: CN202311108184.9A
Authority: CN
Inventors: 郭大勇; 欧阳奎
Original assignee: Shanghai Tongban Information Service Co ltd
Current assignee: Shanghai Tongban Information Service Co ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-28

Abstract

The application discloses a method, a device, equipment and a storage medium for extracting sensitive information in the technical field of information processing, wherein the method comprises the following steps: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text; training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information; inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information. According to the method, after the model is trained by using the text data marked with the sensitive information, the sensitive information can be extracted by using the model obtained by training, so that the method is applicable to different scenes or requirements, has universality, does not need to write a program before each extraction, and can effectively improve the extraction efficiency of the sensitive information.

Description

Method, device, equipment and storage medium for extracting sensitive information

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting sensitive information.

Background

Along with the rapid development of information technology, the files required to be inspected by related institutions are increasingly huge, and a large amount of sensitive information is accumulated continuously; with the development of society and age, desensitization has become increasingly important.

For the extraction of sensitive information, programmers write scattered programs to operate files through specific scenes or according to specific requirements at present, the variability is relatively large, and a fixed standard is not formed; and each time sensitive information is extracted, a program needs to be written, and then files can be operated based on the written program, so that the extraction efficiency of the sensitive information is low.

Disclosure of Invention

The application aims to provide a method, a device, equipment and a storage medium for extracting sensitive information, which can be suitable for different scenes or requirements, have universality and can effectively improve the extraction efficiency of the sensitive information.

In order to achieve the above object, the present application provides the following technical solutions:

a method of extracting sensitive information, comprising:

acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;

training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;

inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.

Preferably, before training the preset model by using the data set, the method further includes:

determining sensitive information marked in a marked text as appointed information;

determining other sensitive information with similarity reaching a similarity threshold value with the specified information, and replacing the specified information by the other sensitive information to obtain a new labeling text;

and adding the obtained new annotation file into the data set.

Preferably, before determining other sensitive information of which the similarity with the specified information reaches the similarity threshold, the method further comprises:

and respectively acquiring word vectors of the specified information and any other sensitive information, and calculating the similarity between the two word vectors as the similarity between the two corresponding information.

and replacing a random named entity in the random labeling text by using other words with the same properties for any labeling text in the data set, and adding the new labeling text obtained after replacement into the data set.

Preferably, before inputting the text to be extracted without marked sensitive information into the information extraction model, the method further comprises:

acquiring a target file and judging whether the target file is a document in a preset format;

if yes, reading the target file according to the line, and taking the text read each time as the text to be extracted respectively; and if not, filtering the target file.

Preferably, after determining the sensitive information in the text to be extracted based on the extraction result, the method further includes:

and counting the quantity, the number of rows and the proportion of the sensitive information in the target file, and outputting the counting result in a structuring way.

Preferably, the preset model is a bert+bilstm+crf model.

An apparatus for extracting sensitive information, comprising:

an acquisition module for: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;

training module for: training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;

an extraction module for: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.

An apparatus for extracting sensitive information, the apparatus comprising a memory and a processor, a program stored on the memory which when executed by the processor performs the steps of the method of extracting sensitive information as claimed in any one of the preceding claims.

A computer readable storage medium, which when executed by a processor implements the steps of the method of extracting sensitive information as claimed in any one of the preceding claims.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for extracting sensitive information according to an embodiment of the present application;

FIG. 2 is a diagram of a model structure in a method for extracting sensitive information according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating preprocessing in a method for extracting sensitive information according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a device for extracting sensitive information according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of a method for extracting sensitive information according to an embodiment of the present application may specifically include:

s11: and acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with the sensitive information contained in the labeling text.

The sensitive information can be a sensitive word determined according to actual needs, such as position information, certificate information and the like. The text in the embodiment of the application can be word text, the corresponding data set is word data set, and other settings such as txt text and the like can be performed according to actual needs; the word text is taken as an example for specific explanation.

The embodiment of the application can label Chinese, english, pinyin, sentences and the like of the sensitive information in the text by adopting BIO labeling, so that the labeled text labeled with the sensitive information contained in the labeled text can be obtained. In particular, sentences or phrases for model training may be prepared, which contain a feature ner_label, which represents sensitive information entities in the sentences or phrases, i.e. BIO tagged entities; the sentences or phrases marked with the sensitive information contained in the sentences or phrases are marked texts, so that a data set containing a plurality of marked texts is obtained. In the BIO notation, B represents the beginning (Begin) of the sensitive information entity, I represents the interior (Inside) of the sensitive information entity, and O represents the non-entity part (outlide); for example, for the sentence "I live in the sunny region," where "sunny region" is sensitive information, the following BIO labeling may be performed: i live in the yang-ward region, OOOOBII.

S12: training a preset model by using the data set to obtain an information extraction model for extracting sensitive information.

Training the preset model by using the data set in step S11, an information extraction model for extracting the sensitive information may be obtained. The preset model may be a model selected in advance according to actual needs, such as a neural network model, a self-encoder model, and the like, which is not limited herein.

S13: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.

After the information extraction model for extracting the sensitive information is obtained, if the sensitive information in a certain text is required to be extracted, the text can be input into the information extraction model, the information extraction model can output a result for marking the sensitive information in the text, and the sensitive information in the text can be directly positioned based on the result, so that the sensitive information can be rapidly extracted.

After the data set is acquired and before model training is performed by using the data set, the embodiment of the application can also perform data enhancement processing on the data set so as to increase the diversity of training data.

In a specific implementation manner, before training the preset model by using the data set, the method may further include:

and adding the obtained new annotation file into the data set.

Before determining other sensitive information, the similarity between the specific information and the other sensitive information reaches the similarity threshold value, the method may further include:

This embodiment is based on word vector data enhancement processing. The similarity threshold may be set according to actual needs, if the similarity between two pieces of information reaches the similarity threshold, the two pieces of information are considered to be higher in similarity, otherwise, the two pieces of information are considered to be lower in similarity and dissimilar. And, the BERT vector of the sensitive information may be utilized as a word vector of the sensitive information.

Because the sensitive information used for expanding the data set is the sensitive information in the labeling text, the similarity is higher, and the diversity of the training data can be effectively increased under the condition that the semantics of the labeling text are not changed. For example, for the sensitive information "sunward region" of the sentence "i live in sunward region", it is found that "sunward", "sunward masses", "east urban region" are words with higher similarity by comparison with the similarity of other words, so that the three words can be utilized for data set expansion, so that the diversity of training data can be increased without changing the sentence semantics.

In another specific implementation manner, before training the preset model by using the data set, the method may further include:

This particular implementation is a data enhancement process based on random word substitution. For named entities in the annotation text, randomly selecting one entity, randomly selecting a word from the vocabulary to replace the entity, generating the replaced annotation text, and adding the replaced annotation text into the data set to realize expansion of the data set. In addition, in order to make the semantic and logic of the label text after replacement reasonable, words with the same property as the random named entity can be selected to replace the label text, if the random named entity is a number, the word used for replacement is also a number, the random named entity is a food name, and the word used for replacement is also a food name; etc.

The probability of replacement can be set, 50% of probability is defined for replacement operation, for example, for a sentence of "Zhang san Shang an apple", the result of word segmentation is "Zhang san", "Zhan", "one", "apple", a word is randomly selected for replacement, for example, "one" is selected for replacement, and the replacement is replaced by "two", "three", so that a data set is expanded, and the performance and the robustness of a model are improved; moreover, in order that the replaced text should be semantically and logically reasonable, for example, the replaced words have certain context consistency in context, so that the text is not disordered or semantically confused, and most of the selected numbers are replaced.

The method for extracting the sensitive information provided by the embodiment of the application can further comprise the following steps before inputting the text to be extracted without the sensitive information into the information extraction model:

After determining the sensitive information in the text to be extracted based on the extraction result, the method may further include:

Before the information extraction model is used for extracting the sensitive information, a target file can be acquired firstly, wherein the target file is a file which is not marked with the sensitive information and needs to be subjected to sensitive information extraction; and judging whether the target file is a document in a preset format, if so, reading the target file according to the line, taking the text read each time as the text to be extracted, and otherwise, filtering the target file. Specifically, the target file may be read in a row by python-docx, and the read text is predicted by a batch-in information extraction model, so as to predict the sensitive information contained in the currently-in text, and output a corresponding extraction result. Therefore, the text input into the information extraction model is ensured to be the text meeting the corresponding format requirement, and further the smooth realization of information extraction is ensured.

In addition, after the extraction result output by the information extraction model is obtained, the quantity of the sensitive information in the target file, the number of lines where the sensitive information is located and the proportion of the sensitive information in the target file can be counted, so that the sensitive information is processed into structured data and output, and the condition of the sensitive information in the target file can be conveniently obtained. Such as may be implemented in the following languages:

wherein, the checkResult represents the extraction result, the fileName represents the file name of the target file, the sensorword represents the sensitive information in the target file, the wordCount represents the statistical number of the sensitive information, the wordwight represents the proportion of the sensitive information in the whole article space of the target file, and the wordLines represents the number of lines of the sensitive information in the text of the target file.

In the method for extracting sensitive information provided by the embodiment of the application, the preset model may be specifically a bert+bilstm+crf model.

In the embodiment of the application, the information extraction model adopts a BERT+BiLSTM+CRF model, and the model mainly comprises 3 modules. Firstly, the input text passes through a BERT pre-training language model to obtain corresponding word vector representations, wherein the word vectors contain context information and can capture the semantics of words and the context of sentences; then, the word vector obtained by BERT is used as input, and is further processed through BiLSTM, the BiLSTM models the input word vector through a forward LSTM network and a backward LSTM network, and a hidden state sequence with rich context information is generated, so that the BiLSTM can capture long-distance dependency relationship in the input word vector, and more accurate feature representation is provided; and finally, the CRF decodes the output of the BiLSTM to generate a predicted labeling sequence, namely an extraction result, wherein the labeling sequence comprises BIO labels for each word and is used for indicating whether the labels belong to sensitive information entities, and the CRF module considers the dependency relationship among the labels and models the label sequence by defining global label transition probability so as to better capture the context association among the labels. Thereby completing the whole process of the identification of the sensitive information entity. The BERT+BiLSTM+CRF model structure is shown in FIG. 2.

The preprocessing of the BERT data, as shown in fig. 3, may include: adding the SEP characters at the beginning of the text and the end of the CLS, w1 represents each word in a single paragraph of the text, w2 represents the second word in the single paragraph of the text, and so on; converting the text into input text, carrying out residual linking through a layer of Multi-Head attribute, carrying out norm normalization, carrying out residual linking and norm normalization through a feedforward neural network, and finally outputting two vectors which are respectively output 1 and output 2, wherein the output 1 vectors are used for judging whether the text contains sensitive information, and the output 2 is used for inputting a BiLSTM model and carrying out named entity recognition so as to determine the position of the sensitive information.

It should be noted that the training the preset model to obtain the information extraction model may include: the input text is subjected to BERT (binary automatic test) pre-training coding to generate word vectors relevant to context; the BiLSTM respectively carries out sequence modeling on the BERT coded word vectors from front to back to obtain a hidden state sequence with rich context information; taking the output of BiLSTM as the input of CRF, and carrying out label classification on the sequence by CRF to generate a final named entity label sequence; the model is trained through back propagation and gradient descent, and model parameters are optimized, so that the model can obtain better performance on NER tasks. Wherein, the F1-score can be calculated as a model evaluation index by using a multi-classification model LeNet5 according to the following formula:

wherein P represents the accuracy rate, R represents the recall rate, F1 represents F1-score, alpha is the number of identified correct sensitive information, A is the number of total sensitive information, and B is the number of identified sensitive information.

Therefore, one task of the information extraction model of the embodiment is fused in the same model, the model has 1 input and 1 output, the identification of the sensitive information entity in each paragraph in the target file is finally completed, the F1 score is calculated, and finally the structure of each sensitive information is output.

In a specific implementation manner, a method for extracting sensitive information provided by the embodiment of the application specifically may include the following steps:

1. data set preparation: sentences or phrases containing sensitive information for model training are prepared and subjected to data enhancement processing, including word vector-based data enhancement and random word replacement-based data enhancement.

2. And training the information extraction model by utilizing the data set subjected to the data enhancement processing.

3. Sensitive information contained in each text is predicted using an information extraction model.

4. And carrying out structured output of the sensitive information related data based on the extraction structure.

The application can rapidly and automatically extract the sensitive information in the file in batches, count the number of the information and the proportion of the information in the whole article, greatly shorten the detection time of the complex word and improve the detection flexibility.

The embodiment of the application also provides a device for extracting the sensitive information, as shown in fig. 4, which specifically may include:

an acquisition module 11 for: acquiring a data set containing a plurality of labeling texts, wherein each labeling text is respectively labeled with sensitive information contained in the labeling text;

training module 12 for: training a preset model by utilizing the data set to obtain an information extraction model for extracting sensitive information;

an extraction module 13 for: inputting the text to be extracted without marked sensitive information into the information extraction model to obtain an extraction result output by the information extraction model, and determining the sensitive information in the text to be extracted based on the extraction result to realize the extraction of the sensitive information.

The device for extracting the sensitive information provided by the embodiment of the application can further comprise:

a first enhancement module for: before training a preset model by utilizing the data set, determining sensitive information marked in a marked text as specified information; determining other sensitive information with similarity reaching a similarity threshold value with the specified information, and replacing the specified information by the other sensitive information to obtain a new labeling text; and adding the obtained new annotation file into the data set.

The first enhancement module may also be configured to: before other sensitive information, the similarity of which with the specified information reaches a similarity threshold, is determined, word vectors of the specified information and any other sensitive information are respectively obtained, and the similarity between the two word vectors is calculated and used as the similarity between the two corresponding information.

a second enhancement module for: before training a preset model by using the data set, any labeling text in the data set is replaced by other words with the same properties, and a new labeling text obtained after replacement is added into the data set.

a reading module for: before inputting a text to be extracted without marked sensitive information into the information extraction model, acquiring a target file, and judging whether the target file is a document in a preset format or not; if yes, reading the target file according to the line, and taking the text read each time as the text to be extracted respectively; and if not, filtering the target file.

a statistics module for: and after the sensitive information in the text to be extracted is determined based on the extraction result, counting the quantity, the number of rows and the proportion of the sensitive information in the target file, and structuring and outputting the counting result.

The device for extracting the sensitive information provided by the embodiment of the application can be a BERT+BiLSTM+CRF model.

The embodiment of the application also provides a device for extracting the sensitive information, which can comprise a memory and a processor, wherein the program stored on the memory can realize the steps of the method for extracting the sensitive information when being run by the processor.

Embodiments of the present application also provide a computer-readable storage medium, which when executed by a processor, implements the steps of the method of extracting sensitive information as described in any of the above.

It should be noted that, for the description of the device, the apparatus and the relevant part in the storage medium for extracting the sensitive information provided in the embodiment of the present application, please refer to the detailed description of the corresponding part in the method for extracting the sensitive information provided in the embodiment of the present application, and the detailed description is omitted herein. In addition, the parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail, so that redundant descriptions are avoided.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of extracting sensitive information, comprising:

2. The method of claim 1, further comprising, prior to training the pre-set model with the dataset:

and adding the obtained new annotation file into the data set.

3. The method of claim 2, wherein determining other sensitive information for which the similarity to the specified information reaches a similarity threshold, further comprises:

4. A method according to claim 3, further comprising, prior to training a pre-set model using the dataset:

5. The method of claim 4, further comprising, prior to entering text to be extracted that is not labeled with sensitive information into the information extraction model:

6. The method of claim 5, further comprising, after determining the sensitive information in the text to be extracted based on the extraction result:

7. The method of claim 6, wherein the predetermined model is a bert+bilstm+crf model.

8. An apparatus for extracting sensitive information, comprising:

9. An apparatus for extracting sensitive information, characterized in that the apparatus comprises a memory and a processor, a program stored on the memory, which, when being executed by the processor, is adapted to carry out the steps of the method for extracting sensitive information according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a program stored on the computer-readable storage medium, when executed by a processor, carries out the steps of the method of extracting sensitive information according to any one of claims 1 to 7.