Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for identifying private data in a database, which can improve the efficiency of identifying private data in the database.
In a first aspect, there is provided a method of identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:
forming a queue by using each field in each data table included in the database;
according to the sequence of each field in the queue, processing operation is sequentially carried out on the current first field, and the processing operation comprises the following steps:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field;
and identifying whether the second field belongs to the private data by using a mode corresponding to the preset relation to obtain a second identification result, and using the second identification result as an identification result label of the second field.
In a possible implementation manner, the forming a queue of each field in each data table included in the database includes:
and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.
Further, the identifying whether the first field belongs to private data includes:
acquiring sample data corresponding to the field name of the first field from the database;
and inputting the sample data into a private data identification model to obtain the first identification result.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
In a possible embodiment, the searching for the second field having the preset relationship with the first field includes:
searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relationship map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
Further, the searching for the second field having the preset relationship with the first field from the pre-established data relationship map includes:
and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.
In one possible embodiment, the predetermined relationship is replication;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
and directly determining that the second identification result is that the second field belongs to private data.
In a possible embodiment, the predetermined relationship is a truncation;
the identifying whether the first field belongs to private data comprises:
respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
In a possible embodiment, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
In a second aspect, there is provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:
a queue forming unit, configured to form a queue for each field in each data table included in the database;
a first identifying unit, configured to perform processing operations on a current first field in sequence according to the sequence of each field in the queue obtained by the queue forming unit, where the processing operations include:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
the searching unit is used for searching a second field having a preset relation with the first field if the first identification result obtained by the first identification unit indicates that the first field belongs to private data;
and the second identification unit is used for identifying whether the second field searched by the search unit belongs to the private data or not by using a mode corresponding to the preset relationship to obtain a second identification result, and the second identification result is used as an identification result label of the second field.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, wherein a field corresponds to a column. Referring to fig. 1, the database includes n data tables, which are respectively denoted as table1, table2, …, and table n, where table1 includes i columns, table2 includes j columns, …, and table n includes k columns, and the database further includes a metadata table in which information of each data table is recorded, for example, information such as a field name and a storage location corresponding to each column in the data table.
Generally, when identifying private data in a database, a metadata table is read first, information of each data table is obtained from the metadata table, then, data of one table is fished for each time based on a bottom-layer database interface according to the information of each data table, and then, private data identification is performed on each column of data in the table to judge whether the column belongs to the private data. Since the types of the private data corresponding to the private data are usually dozens of types, some private data identification may be based on a deep learning model, the computation is complex, and when the data amount of the database is very large, it is difficult to identify the whole database within an acceptable time range.
According to the embodiment of the specification, the private data is identified by using the relation between the fields, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data in the database can be greatly improved.
The relationship between the fields may be a blood relationship between the fields, and the blood relationship is used to describe an upstream-downstream relationship between data, and generally includes copying, truncation, splicing, conversion, and the like, and represents that data of one field is processed to obtain data of another field.
Fig. 2 shows a flow diagram of a method of identifying private data in a database comprising a plurality of data tables, each data table comprising a plurality of fields, according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the method for identifying private data in a database in this embodiment includes the following steps:step 21, forming a queue for each field in each data table included in the database;step 22, according to the sequence of each field in the queue, sequentially performing processing operation on the current first field, where the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;step 23, if the first identification result indicates that the first field belongs to private data, searching for a second field having a preset relationship with the first field; and 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. Specific execution modes of the above steps are described below.
First, instep 21, each field in each data table included in the database is formed into a queue. It can be understood that each field in the queue has a certain sequence, and each field in each data table can be sorted disorderly, or each field in the same data table can be sorted at an adjacent position.
In one example, the forming a queue of each field in each data table included in the database includes:
and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.
It is understood that the field name can uniquely identify a field, for example, a Globally Unique Identifier (GUID) is used as the field name, which is specifically in the form of project _ name.
Then, instep 22, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, where the processing operation includes: and under the condition that the first field does not have the identification result label, identifying whether the first field belongs to the private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field. It will be appreciated that each field need only be identified once and that the field with the tag of the identification result need not be identified again.
In one example, the identifying whether the first field belongs to private data includes:
acquiring sample data corresponding to the field name of the first field from the database;
and inputting the sample data into a private data identification model to obtain the first identification result.
It is understood that the sample data is a column of data corresponding to the field name of the first field, or a partial data in the column of data. The columns of a data table are shown as table one.
Table one: data sheet
| Column 1 | Column 2 | Column 3 |
| Line 1 | Xiao Hong | Woman | Age 15 |
| Line 2 | Xiaoming liquor | For male | Age 16 |
| Line 3 | Small steel | For male | Age 17 |
| Line 4 | Small blue | Woman | Age 14 |
Referring to table one, which is a data table with 4 rows and 3 columns, if the field name of the first field corresponds to column 1, all data in column 1 may be used as sample data, where the sample data includes pink, xiaoming, xiaojian, and xiaolan; alternatively, partial data of column 1 may be taken as sample data, for example, the sample data includes only small red.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
It is understood that a regular expression (regular expression), also called regular expression, regular representation, constructs a single character string to describe and match a series of character strings conforming to a certain syntax rule based on a agreed grammar rule.
A language model (language model) is a mathematical model that describes the probability of a certain word string or character string using a probability distribution.
The check rule may have a plurality of pieces, for example, by determining whether all the sample data are numbers and whether the number of the numbers is a predetermined number of digits, the identification of the private data is performed.
The multi-classification model can be obtained through machine learning and can also be called a neural network model or a deep learning model.
Then, instep 23, if the first identification result indicates that the first field belongs to private data, a second field having a preset relationship with the first field is searched. It will be appreciated that if a first field belongs to private data, then a second field having a predetermined relationship with the first field must belong to private data, or with a greater probability belong to private data.
In one example, the finding a second field having a preset relationship with the first field comprises:
searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
In the embodiments of the present description, the data relationship map may exist in the form of a map database. Graph database: also known as a graph database, is a non-relational database that uses graph theory to store relational information between entities. Compared with a relational database, the graph database can be conveniently and rapidly inquired and can be used for various calculations and reasoning.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relationship map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
The SQL parsing is a basic stone for constructing data lineage relationships, and mainly parses fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, and generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depended); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship. For example, the following SQL:
Create Table1 as
Select identify_no,mobile_no
From Table2;
the blood relationship obtained by SQL analysis includes:
(table2. identity _ no, table1. identity _ no, copy), representing the relationship that the identity field in table1 is duplicated with the identity field in table 2;
(table2.mobile _ no, table1.mobile _ no, copy), representing the relationship that the phone number field in table1 is duplicated with the phone number field in table 2;
(Table2, Table1, depended), representing that Table1 and Table2 are dependencies;
(Table1. identity _ no, Table1, belong), representing the relationship to which the identity field in Table1 belongs to Table 1;
(Table1.mobile _ no, Table1, belong), representing the relationship that the phone number field in Table1 belongs to Table 1;
(Table2.mobile _ no, Table2, belong), representing the relationship that the phone number field in Table2 belongs to Table 2;
(Table2. identity _ no, Table2, belong), representing the relationship that the identity field in Table2 belongs to Table2.
At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.
Further, the searching for the second field having the preset relationship with the first field from the pre-established data relationship map includes:
and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.
In the embodiment of the present specification, the second field may be searched by using a depth-first search method, where the depth-first search method accesses a vertex v from a vertex v in the graph; sequentially starting from the non-accessed adjacent points of the vertex v, and performing depth-first traversal on the graph; until vertices in the graph that have a path to vertex v are visited; if the vertex in the graph is not accessed, starting from an unvisited vertex, performing depth-first traversal again until all the vertices in the graph are accessed. It will be appreciated that the vertex v corresponds to the first field.
And finally, instep 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. It can be understood that, whether the second field belongs to the private data or not is identified, the identification mode of the second field is different from that of the first field, and when whether the second field belongs to the private data or not is identified, the preset relation is considered, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data is greatly improved.
In one example, the preset relationship is replication;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
and directly determining that the second identification result is that the second field belongs to private data.
It can be understood that, if the first field and the second field are in a copied relationship, on the premise that the first field is already identified as private data, the second field necessarily belongs to the private data, and it is not necessary to identify the private data for the second field, thereby improving the identification efficiency.
In one example, the preset relationship is truncation;
the identifying whether the first field belongs to private data comprises:
respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
It can be understood that, if the first field and the second field are in a truncated relationship, that is, the second field is a substring of the first field, and on the premise that the first field has been identified as the private data, the second field has a higher probability of belonging to the private data, the range of identifying the private data for the second field can be reduced, and the amount of calculation is reduced relative to the amount of calculation for identifying the first field, thereby improving the identification efficiency.
In one example, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
In the embodiments of the present specification, the number of types of private data is large, which is also one reason why the efficiency in the general private data identification is low. The type of private data that is common today is shown in table two.
Table two: common private data types
Referring to the table two, because the types of the private data are various, the identification is complex, and the calculation amount is large in general.
According to the method provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
FIG. 3 illustrates a system architecture diagram for identifying private data in a database, according to one embodiment. Referring to fig. 3, firstly, themetadata parsing module 31 will read and parse relevant data from a metadata table in a database, and place the parsing result in a queue, where thescanner 32 assumes that a thread is used for processing, reads an element from the queue each time, and based on the reading, retrieves corresponding sample data from the database, and then performs data recognition by using a built-in private data recognition module, and if the recognition result is not sensitive data, continues to read the next element from the queue for consumption; and if the identification result is sensitive data, looking up the upstream and downstream elements with the copy relationship from the data blood relationship map by using a depth-first search algorithm, and then labeling the relevant elements with the sensitive data in the database.
Wherein the database decoupling unit: the data for the client is stored in different types of databases, such as MYSQL, ORACLE and the like, and the problems brought by different databases are shielded through a uniform interface.
Built-in private data identification module: and storing various logics for identifying the data sensitive data types, including a regular expression, a language model, a check rule, a multi-classification model and the like.
A scanning logic unit: the scan logic is used for executing scan logic, adopts a mixed scheme of linear scan and tree scan, is based on sequential scan, and is converted into tree scan when sensitive data is identified.
Data sampling logic: for sampling data from the database by metadata to provide a data basis for subsequent scanner identification.
It is understood that the sensitive data is private data.
FIG. 4 shows a schematic diagram of a fast private data scanning method based on consanguinity relations according to one embodiment. Referring to fig. 4, in the embodiment of the present disclosure, a sequential scan is used as a cold start entry, each column of each table is sequentially scanned from top to bottom according to a metadata table, and once a certain sensitive data type is scanned, the table is immediately scanned into a data edge map in a depth-first search manner, and two conditions need to be satisfied during the search: if the edge dependency relationship is a copy relationship, continuing searching downwards until the edge dependency relationship is not the copy relationship, otherwise, backtracking upwards; the searched node needs to be connected with the original node, otherwise, the search is stopped.
As shown in fig. 4, when the column i in table1 is scanned sequentially to be sensitive data, depth-first search is performed in the data blood-level map immediately, and at this time, it is found that there is a duplicate relationship between column 1 in table2 and table n and column i in table1, and then the column 1 in table2 and table n corresponding to the column is directly set to be the same type of sensitive data as column 1 in table1, so that the calculation amount of the built-in private data identification module of the scanner is reduced, and the identification efficiency is improved.
It should be noted that, when querying the data blood-margin map, the same effect can be achieved by performing connected graph traversal using the breadth-preferred search algorithm. In addition, the data relationship can be stored not by using a graph database but by using a general relational database.
According to an embodiment of another aspect, there is also provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus being configured to perform the method provided by the embodiments of the present specification. Fig. 5 shows a schematic block diagram of an apparatus for identifying private data in a database according to one embodiment. As shown in fig. 5, theapparatus 500 includes:
aqueue forming unit 51, configured to form a queue for each field in each data table included in the database;
a first identifyingunit 52, configured to sequentially perform, according to the sequence of each field in the queue obtained by thequeue forming unit 51, a processing operation on a current first field, where the processing operation includes:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
a searchingunit 53, configured to search, if the first identification result obtained by the first identifyingunit 52 indicates that the first field belongs to private data, a second field having a preset relationship with the first field;
the second identifyingunit 54 is configured to identify, by using a manner corresponding to the preset relationship, whether the second field found by the searchingunit 53 belongs to the private data, to obtain a second identification result, and use the second identification result as an identification result tag of the second field.
Optionally, as an embodiment, thequeue forming unit 51 is specifically configured to parse a metadata table in the database to obtain field names of the fields, and sort the field names to form the queue.
Further, the first identifyingunit 52 includes:
the acquisition subunit is configured to acquire, from the database, sample data corresponding to the field name of the first field;
and the identification subunit is used for inputting the sample data acquired by the acquisition subunit into a privacy data identification model to obtain the first identification result.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
Optionally, as an embodiment, the searchingunit 53 is specifically configured to search, from a data relationship map established in advance, a second field having a preset relationship with the first field; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relation map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
Further, the searchingunit 53 is specifically configured to search, starting from the node corresponding to the first field, for a node whose relation corresponding to the connection edge is the preset relation until the relation of the connection edge is not the preset relation, and use a field corresponding to the searched node as the second field.
Optionally, as an embodiment, the preset relationship is replication;
the second identifyingunit 54 is specifically configured to directly determine that the second identification result is that the second field belongs to private data.
Optionally, as an embodiment, the preset relationship is truncation;
the first identifyingunit 52 is specifically configured to respectively identify whether the first field belongs to the private data by using each identification model in the first identification model set, to obtain each first identification sub-result, and comprehensively determine the first identification result according to each first identification sub-result;
the second identifyingunit 54 is specifically configured to identify, by using at least one identification model in a second identification model set, whether the second field belongs to the private data, so as to obtain the second identification result; the second set of recognition models is a subset of the first set of recognition models.
Optionally, as an embodiment, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
With the apparatus provided in this specification, first, thequeue forming unit 51 forms a queue for each field in each data table included in the database; then, the first identifyingunit 52 performs processing operations on the current first field in sequence according to the sorting of the fields in the queue, where the processing operations include: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then, when the first identification result indicates that the first field belongs to private data, the searchingunit 53 searches for a second field having a preset relationship with the first field; finally, the second identifyingunit 54 identifies whether the second field belongs to the private data by using a manner corresponding to the preset relationship, so as to obtain a second identification result, and uses the second identification result as an identification result tag of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.