Sensitive data identification method, system and deviceTechnical Field
The invention relates to the technical field of information security, in particular to a sensitive data identification method, a sensitive data identification system and a sensitive data identification device.
Background
With the increasing importance of data security, how to protect data inside enterprises from being leaked gradually draws attention of all social layers, and many companies put higher demands on the security of sensitive data inside.
Unstructured data (including text, pictures, etc.) accounts for over 80% of enterprise data and grows at a rate of 55% to 65% per year. However, the prior art is more of a process of identifying and desensitizing structured data. How to identify and desensitize sensitive data in large-scale and diversified unstructured data is an urgent problem to be solved.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a method for identifying sensitive data, which is used to solve the problem in the prior art of identifying sensitive data in unstructured data.
The technical scheme adopted by the invention for solving the technical problems is as follows.
In a first aspect, a sensitive data identification method is provided, including:
analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words;
inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and determining the position of the sensitive data in the text data according to the first labeling sequence.
In a second aspect, a sensitive data recognition system is provided, comprising a memory and a processor, wherein the memory is configured to store executable program code; the processor is connected with the memory, and executes a program corresponding to the executable program code by reading the executable program code stored in the memory so as to execute the sensitive data identification method.
In a third aspect, a sensitive data identification apparatus is provided, including:
the analysis unit is used for analyzing the unstructured data to obtain text data corresponding to the unstructured data, and the text data comprises a plurality of words;
the recognition unit is used for inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and the determining unit is used for determining the position of the sensitive data in the text data according to the first annotation sequence.
In the embodiment of the invention, each word in the text data can be better learned by a language model based on deep learning, and a labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining a Conditional Random Field (CRF), so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Drawings
FIG. 1 is a flow chart of a sensitive data identification method according to an embodiment of the present invention;
FIG. 2 is a diagram of an SDK embedded application provided by an embodiment of the present invention;
FIG. 3 is a flowchart of sensitive data recognition model training according to a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a sensitive data recognition model provided by an embodiment of the invention;
FIG. 5 is a flow chart of a process of passing text data through a sensitive data recognition model during a recognition phase according to an embodiment of the present invention;
FIG. 6 is a flowchart of parsing unstructured data according to a third embodiment of the present invention;
fig. 7 is a schematic structural diagram of a sensitive data identification device according to a fourth embodiment of the present invention;
fig. 8 is a block diagram of a sensitive data identification system according to a fifth embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the embodiment of the invention, each word in the text data can be better represented by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Example one
Fig. 1 is a flowchart of a sensitive data identification method according to an embodiment of the present invention. As shown in fig. 1, the method includes:
step S101: and analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words.
In the embodiment of the invention, the unstructured data is analyzed and extracted to obtain the corresponding text data, and the unstructured data includes but is not limited to WORD, EXCEL, PPT, TXT, PDF, XML, database text fields, pictures, and the like. The text data includes a plurality of words, which can be classified into a word granularity (Token granularity) level.
Step S102: and inputting the text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
In the embodiment of the invention, the sensitive data recognition model comprises an unsupervised pre-trained bidirectional language model based on deep learning, such as BERT, ELMo, GPT and the like, and the text data obtains a word vector with context information through the language model; the method comprises the steps that a word vector passes through the full connection layer and the CRF in sequence to obtain the probability that each word belongs to each sensitive entity attribute and the first labeling sequence with the maximum joint distribution probability of the sensitive entity attributes aiming at each word, wherein the first labeling sequence is a sentence-level labeling sequence. For the output of the CRF, optimization processing such as Viterbi decoding and softmax normalization may also be performed.
Step S103: and determining the position of the sensitive data in the text data according to the first labeling sequence.
And distinguishing the position of the sensitive data according to the sensitive entity attribute in the first labeling sequence.
Step 104: and desensitizing the sensitive data according to the position of the sensitive data in the text data.
In the embodiment of the invention, after the position of the sensitive data is identified, desensitization processing such as shielding, replacing, erasing, format preservation encryption, symmetric encryption, date generalization, numerical generalization, phrase generalization and the like is carried out on the identified sensitive data.
In the embodiment of the invention, each word in the text data can be better represented by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
Preferably, the sensitive data identification and desensitization methods described in steps S101-S104 are integrated into a Software Development Kit (SDK), and are opened in an Application Programming Interface (API) manner, such as Restful or grpc. The SDK is embedded into an application program, the application program calls a corresponding API in the SDK according to requirements, and then a result is returned by a corresponding service. Fig. 2 is a schematic diagram of an SDK embedded application. The API is as follows:
the method has the advantages that the development period is short, the application program is embedded simply and conveniently, and enterprises can be helped to integrate the sensitive data identification and desensitization method into enterprise products more simply and conveniently, so that the data protection capability of the enterprises is improved.
Example two
As an embodiment of the invention, a sensitive data recognition model needs to be trained before the sensitive data recognition is carried out on the text data. Fig. 3 is a flowchart of sensitive data recognition model training according to a second embodiment of the present invention. As shown in fig. 3, before parsing the unstructured data, the method includes:
step S301: the training data is segmented into a plurality of words.
The sensitive data recognition model is trained through a large amount of training data. The training data includes a plurality of words, each of which can be classified to a level of word granularity (Token granularity).
Step S302: and carrying out sensitive entity attribute labeling on the training data by adopting a preset identifier to obtain a second labeling sequence.
In the embodiment of the invention, sensitive entity attributes and identifiers thereof are defined firstly. And sensitive entity attributes such as name, age, native place, identification card number, mobile phone number, mailbox, organization name and the like. And adopting a labeling method such as BIO or BIOES, wherein the identifier comprises a direct identifier and a quasi-identifier. The direct identifier can directly locate personal attributes such as name, identification card, mobile phone number, etc.; a single quasi-identifier cannot directly locate an individual, but a combination of multiple quasi-identifiers can locate an individual. Desensitizing these two major types of identifiers can resist attacks in most cases, greatly reducing the risk of privacy disclosure. Identifiers such as B, I, E, O and S. Wherein, the identifier B is the initial identification of the attribute of the sensitive entity; the identifier I is a continuation identifier of the attribute of the sensitive entity; the identifier E is an ending identifier of the attribute of the sensitive entity; identifier O represents a non-sensitive entity; the identifier S represents a single sensitive entity. Meanwhile, "[ CLS ]" and "[ SEP ]" are used at the beginning and end of a sentence, respectively. For example: the training data "the capital of the people's republic of china is beijing", and the division into word granularities is "china/people/republic/capital/beijing". In this sentence, the identifier of the sensitive entity attribute of "china" is B, the identifier of the sensitive entity attribute of "people" is I, the identifier of the sensitive entity attribute of "republic" is E, the identifiers of the sensitive entity attributes of "first capital" and "yes" are all O, and the identifier of the sensitive entity attribute of "beijing" is S, so the annotation sequence of the whole sentence is "[ CLS ] bieoos [ SEP ]". The training data is text data for which the second annotation sequence is known.
Step S303: and inputting the training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
In an embodiment of the invention, the sensitive data recognition model comprises a deep learning based language model, a fully connected layer and a CRF. From another dimension, as shown in FIG. 4, the sensitive data recognition model includes an input layer, a word representation layer, and a decoding layer.
The input layer preprocesses the input training data to eliminate unreasonable character pairs and invalid characters in the training data.
The word representation layer is a bidirectional language model, such as BERT, ELMo, GPT and the like, the word vector with context information is obtained by adopting an encoder end based on a Transformer, text features in training data are mined to the maximum extent, richer word representation is extracted, and the defects that the context information cannot be dynamically represented and a word ambiguity cannot be solved by the traditional word vector (such as word2vec, Glove and the like) are overcome.
The decoding layer comprises a full connection layer and a CRF. The word vectors output by the word representation layer pass through the full connection layer to obtain the probability (including no category) that each word belongs to each sensitive entity attribute, then each probability corresponding to each word in the training data is input into the CRF, and the CRF can obtain the third labeling sequence with the maximum joint distribution probability of the sensitive entity attributes of each word according to the transition probability between the states and the emission probability corresponding to the states.
Step S304: and comparing the third labeling sequence with the second labeling sequence, and stopping training and obtaining the trained sensitive data recognition model when the accuracy is greater than a preset threshold value.
The second labeling sequence is a known artificial labeling sequence, and the third labeling sequence is a prediction sequence obtained after the training data is calculated by a sensitive data recognition model. And in the training stage, comparing the third labeling sequence with the second labeling sequence, and when the accuracy of the third labeling sequence is greater than a preset threshold, such as greater than 95%, considering that the sensitive data recognition model is accurate in prediction, and stopping training. The sensitive data identification model can be used for sensitive data identification.
Correspondingly, in the recognition stage, as shown in fig. 5, in the sensitive data recognition method, the text data is input into a sensitive data recognition model, and a first labeled sequence with the highest probability of joint distribution of the sensitive entity attribute for each word is obtained, where the sensitive data recognition model includes a deep learning-based language model, a full-link layer and a CRF, and includes:
step S501: and inputting the text data into a language model based on deep learning, and obtaining a word vector with context information corresponding to each word.
Step S502: and inputting the word vector into the full-connection layer and the CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word.
In the recognition stage, the processing process of the text data passing through the sensitive data recognition model corresponds to that in the training stage, and the obtained first labeling sequence integrates the contextual characteristics and the labeling dependency relationship.
In the embodiment of the invention, the language model based on deep learning can dynamically represent each word in the text data, the semantic information of the text data is learned, the problems of pattern rigid, insufficient accuracy and poor cross-platform recognition capability existing in the method based on pattern matching are solved, and the method based on pattern matching usually needs expert guidance and consumes a large amount of manpower to write rules. Meanwhile, a label sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining the CRF, the position of the sensitive data in the unstructured data is determined, the problem of error in entity boundary identification caused by the fact that the sensitive entities are long in sentence and a plurality of entities are contained in the same sentence is solved, and the identification accuracy is improved.
EXAMPLE III
As an embodiment of the present invention, when the unstructured data is a picture, as shown in fig. 6, it is a flowchart for parsing the unstructured data according to a third embodiment of the present invention. The method comprises the following steps:
step S601: and determining a character area in the picture to be desensitized through the first neural network.
In the embodiment of the invention, a picture is divided to generate a plurality of sub-text suggestion boxes; simultaneously, extracting the characteristics of the picture to be desensitized by utilizing a convolutional neural network; and inputting the characteristics of the picture and the sub text suggestion boxes into a recurrent neural network to analyze the characteristics of the picture, obtaining the score of each sub text suggestion box containing the text data, thereby determining the sub text suggestion boxes possibly containing the text data, and connecting the sub text suggestion boxes to form a text area containing the text data.
Step S602: text data in the text area is obtained through a second neural network.
In the embodiment of the invention, firstly, the picture in the formed text area range is input into a convolution neural network for text recognition, and picture pixels are converted into feature vectors; and then analyzing the characteristic vector through a recurrent neural network of character recognition to obtain a character sequence, namely text data, wherein characters in the character area are converted into text data which can be understood by a computer. The output of the recurrent neural network may contain repeated characters or spaces, and a character sequence translation process based on connection time sequence classification is preferably added after the recurrent neural network to obtain final text data.
Inputting text data into the sensitive data recognition models described in the first embodiment and the second embodiment, and recognizing sensitive data; and combining the text area in the step S601 to complete the positioning and desensitization processing of the sensitive data on the picture.
In the embodiment of the invention, a convolutional neural network and a cyclic neural network are adopted to identify characters in a picture to be desensitized, pixels in the picture are converted into text data which can be identified by a computer, then each word in the text data can be better learned based on a language model for deep learning, and a labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word in the text data is obtained by combining CRF, so that the position of sensitive data in unstructured data is determined, and the identification accuracy is improved.
Example four
Fig. 7 is a schematic structural diagram of a sensitive data identification device according to a fourth embodiment of the present invention. As shown in fig. 7, the sensitive data recognition apparatus includes: ananalysis unit 71, arecognition unit 72, and adetermination unit 73.
The parsingunit 71 is configured to parse the unstructured data to obtain text data corresponding to the unstructured data, where the text data includes a plurality of words.
Therecognition unit 72 is configured to input the text data into a sensitive data recognition model, so as to obtain a first labeling sequence with a maximum joint distribution probability of the sensitive entity attribute for each word, where the sensitive data recognition model includes a language model based on deep learning, a full-link layer, and a CRF.
The determiningunit 73 is configured to determine a position of the sensitive data in the text data according to the first annotation sequence.
As an embodiment of the present invention, the sensitive data identification apparatus further includes:
the segmentation unit is used for segmenting the training data into a plurality of words.
And the marking unit is used for marking the sensitive entity attribute of the training data by adopting a preset identifier to obtain a second marking sequence.
The training unit is used for inputting training data into a sensitive data recognition model to obtain a third labeling sequence with the maximum joint distribution probability of the sensitive entity attribute aiming at each word, and the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a CRF.
And the judging unit is used for comparing the third labeling sequence with the second labeling sequence, and stopping training and obtaining a trained sensitive data recognition model when the accuracy is greater than a preset threshold value.
Accordingly, therecognition unit 72 includes:
the word representation subunit is used for inputting the text data into the language model based on deep learning, and obtaining a word vector with context information corresponding to each word.
And the decoding subunit is used for inputting the word vector into the full-link layer and the CRF to obtain a first labeling sequence with the maximum joint distribution probability of the sensitive entity attribute for each word.
Preferably, the preset identifier includes a direct identifier and a quasi-identifier.
As an embodiment of the present invention, the sensitive data identification apparatus further includes a desensitization unit, configured to perform desensitization processing on the sensitive data according to a position of the sensitive data in the text data.
Preferably, the parsingunit 71 includes:
the first subunit is used for determining a character area in the picture to be desensitized through the first neural network.
The second subunit is used for acquiring the text data in the text area through the second neural network.
In the embodiment of the invention, each word in the characteristic text data can be better learned by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
EXAMPLE five
Fig. 8 is a block diagram of a sensitive data identification system according to a fifth embodiment of the present invention. As shown in fig. 8, the system includes amemory 81 and aprocessor 82, wherein thememory 81 is used for storing executable program codes; theprocessor 82 is connected to thememory 81, and executes a program corresponding to the executable program code by reading the executable program code stored in thememory 81, so as to perform the following steps:
analyzing the unstructured data to obtain text data corresponding to the unstructured data, wherein the text data comprises a plurality of words;
inputting text data into a sensitive data recognition model to obtain a first labeling sequence with the maximum joint distribution probability of sensitive entity attributes aiming at each word, wherein the sensitive data recognition model comprises a language model based on deep learning, a full connection layer and a conditional random field CRF;
and determining the position of the sensitive data in the text data according to the first labeling sequence.
In the embodiment of the invention, each word in the characteristic text data can be better learned by the language model based on deep learning, and the labeling sequence with the maximum joint distribution probability of the sensitive entity attribute of each word in the text data is obtained by combining CRF, so that the position of the sensitive data in the unstructured data is determined, and the identification accuracy is improved.
The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not to be construed as limiting the scope of the invention. Those skilled in the art can implement the invention in various modifications, such as features from one embodiment can be used in another embodiment to yield yet a further embodiment, without departing from the scope and spirit of the invention. Any modification, equivalent replacement and improvement made within the technical idea of using the present invention should be within the scope of the right of the present invention.