CN113672653A

Movatterモバイル変換

Info

Publication number: CN113672653A
Application number: CN202110909377.9A
Authority: CN
Inventors: 刘佳伟; 鲍梦瑶; 章鹏; 张谦; 殷雪梅; 刘新源
Original assignee: Alipay Hangzhou Information Technology Co Ltd; Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Hangzhou Ant Love Technology Co ltd
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-11-19
Anticipated expiration: 2041-08-09
Also published as: CN113672653B

Abstract

Translated fromChinese

本说明书实施例提供一种识别数据库中的隐私数据的方法和装置，方法包括：将数据库包括的各个数据表中的各个字段，形成队列；按照队列中各个字段的排序，依次针对当前的第一字段进行处理操作，处理操作包括：在第一字段不具有识别结果标签的情况下，识别第一字段是否属于隐私数据，得到第一识别结果，将第一识别结果作为第一字段的识别结果标签；若第一识别结果指示第一字段属于隐私数据，则查找与第一字段具有预设关系的第二字段；利用与预设关系对应的方式，识别第二字段是否属于隐私数据，得到第二识别结果，并将第二识别结果作为第二字段的识别结果标签。能够提高识别数据库中的隐私数据的效率。

The embodiments of this specification provide a method and device for identifying private data in a database. The method includes: forming each field in each data table included in the database into a queue; The processing operation includes: in the case where the first field does not have a recognition result label, identifying whether the first field belongs to private data, obtaining a first recognition result, and using the first recognition result as the recognition result label of the first field If the first identification result indicates that the first field belongs to privacy data, then search for the second field that has a preset relationship with the first field; use the method corresponding to the preset relationship to identify whether the second field belongs to privacy data, and obtain the second field. The recognition result is obtained, and the second recognition result is used as the recognition result label of the second field. The efficiency of identifying private data in the database can be improved.

Description

Method and device for identifying private data in database

Technical Field

One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for identifying private data in a database.

Background

Private data (private data), i.e., secret data, refers to information that is not intended to be known by others or unrelated persons, and from the perspective of the owner of privacy, the private data may be divided into individual private data and common private data, where the individual private data includes information that can be used to locate or identify an individual (e.g., phone number, address, credit card number, etc.) and sensitive information (e.g., personal health, financial information, company critical documents, etc.). The common privacy data mainly takes family privacy as a main part, such as family annual income condition and the like. The disclosure and abuse of private data is highly likely to cause various personal and public security problems. For the protection of private data, fields belonging to the private data need to be identified from a database, which typically comprises a large number of data tables, with on average tens of fields per data table.

In the prior art, when private data in a database is identified, whether each field of each data table belongs to the private data is basically identified one by one, and the performance problem is not obvious in the case of small data volume, but when the data is used for massive data (for example, hundreds of millions of tables and hundreds of millions of fields), the obvious performance problem is mainly represented as incomplete table scanning in a specified time, so that the customer experience is low.

Accordingly, improved solutions are desired that can improve the efficiency of identifying private data in a database.

Disclosure of Invention

One or more embodiments of the present specification describe a method and apparatus for identifying private data in a database, which can improve the efficiency of identifying private data in the database.

In a first aspect, there is provided a method of identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:

forming a queue by using each field in each data table included in the database;

according to the sequence of each field in the queue, processing operation is sequentially carried out on the current first field, and the processing operation comprises the following steps:

under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;

if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field;

and identifying whether the second field belongs to the private data by using a mode corresponding to the preset relation to obtain a second identification result, and using the second identification result as an identification result label of the second field.

In a possible implementation manner, the forming a queue of each field in each data table included in the database includes:

and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.

Further, the identifying whether the first field belongs to private data includes:

acquiring sample data corresponding to the field name of the first field from the database;

and inputting the sample data into a private data identification model to obtain the first identification result.

Further, the private data recognition model comprises at least one of the following recognition logic:

regular expressions, language models, verification rules, multi-classification models.

In a possible embodiment, the searching for the second field having the preset relationship with the first field includes:

searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.

Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.

Further, the data relationship map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.

Further, the searching for the second field having the preset relationship with the first field from the pre-established data relationship map includes:

and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.

In one possible embodiment, the predetermined relationship is replication;

the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:

and directly determining that the second identification result is that the second field belongs to private data.

In a possible embodiment, the predetermined relationship is a truncation;

the identifying whether the first field belongs to private data comprises:

respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;

identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.

In a possible embodiment, the first recognition result and/or the second recognition result includes:

whether the field belongs to private data, and the type of private data when it belongs to private data.

In a second aspect, there is provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:

a queue forming unit, configured to form a queue for each field in each data table included in the database;

a first identifying unit, configured to perform processing operations on a current first field in sequence according to the sequence of each field in the queue obtained by the queue forming unit, where the processing operations include:

the searching unit is used for searching a second field having a preset relation with the first field if the first identification result obtained by the first identification unit indicates that the first field belongs to private data;

and the second identification unit is used for identifying whether the second field searched by the search unit belongs to the private data or not by using a mode corresponding to the preset relationship to obtain a second identification result, and the second identification result is used as an identification result label of the second field.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

According to the method and the device provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a method of identifying private data in a database, according to one embodiment;

FIG. 3 illustrates a system architecture diagram for identifying private data in a database, according to one embodiment;

FIG. 4 illustrates a schematic diagram of a fast private data scanning method based on consanguinity relationships, according to one embodiment;

fig. 5 shows a schematic block diagram of an apparatus for identifying private data in a database according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, wherein a field corresponds to a column. Referring to fig. 1, the database includes n data tables, which are respectively denoted as table1, table2, …, and table n, where table1 includes i columns, table2 includes j columns, …, and table n includes k columns, and the database further includes a metadata table in which information of each data table is recorded, for example, information such as a field name and a storage location corresponding to each column in the data table.

Generally, when identifying private data in a database, a metadata table is read first, information of each data table is obtained from the metadata table, then, data of one table is fished for each time based on a bottom-layer database interface according to the information of each data table, and then, private data identification is performed on each column of data in the table to judge whether the column belongs to the private data. Since the types of the private data corresponding to the private data are usually dozens of types, some private data identification may be based on a deep learning model, the computation is complex, and when the data amount of the database is very large, it is difficult to identify the whole database within an acceptable time range.

According to the embodiment of the specification, the private data is identified by using the relation between the fields, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data in the database can be greatly improved.

The relationship between the fields may be a blood relationship between the fields, and the blood relationship is used to describe an upstream-downstream relationship between data, and generally includes copying, truncation, splicing, conversion, and the like, and represents that data of one field is processed to obtain data of another field.

Fig. 2 shows a flow diagram of a method of identifying private data in a database comprising a plurality of data tables, each data table comprising a plurality of fields, according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the method for identifying private data in a database in this embodiment includes the following steps:step 21, forming a queue for each field in each data table included in the database;step 22, according to the sequence of each field in the queue, sequentially performing processing operation on the current first field, where the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;step 23, if the first identification result indicates that the first field belongs to private data, searching for a second field having a preset relationship with the first field; and 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. Specific execution modes of the above steps are described below.

First, instep 21, each field in each data table included in the database is formed into a queue. It can be understood that each field in the queue has a certain sequence, and each field in each data table can be sorted disorderly, or each field in the same data table can be sorted at an adjacent position.

In one example, the forming a queue of each field in each data table included in the database includes:

It is understood that the field name can uniquely identify a field, for example, a Globally Unique Identifier (GUID) is used as the field name, which is specifically in the form of project _ name.

Then, instep 22, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, where the processing operation includes: and under the condition that the first field does not have the identification result label, identifying whether the first field belongs to the private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field. It will be appreciated that each field need only be identified once and that the field with the tag of the identification result need not be identified again.

In one example, the identifying whether the first field belongs to private data includes:

It is understood that the sample data is a column of data corresponding to the field name of the first field, or a partial data in the column of data. The columns of a data table are shown as table one.

Table one: data sheet

	Column 1	Column 2	Column 3
				Line 1	Xiao Hong	Woman	Age 15
Line 2	Xiaoming liquor	For male	Age 16
				Line 3	Small steel	For male	Age 17
Line 4	Small blue	Woman	Age 14

Referring to table one, which is a data table with 4 rows and 3 columns, if the field name of the first field corresponds to column 1, all data in column 1 may be used as sample data, where the sample data includes pink, xiaoming, xiaojian, and xiaolan; alternatively, partial data of column 1 may be taken as sample data, for example, the sample data includes only small red.

It is understood that a regular expression (regular expression), also called regular expression, regular representation, constructs a single character string to describe and match a series of character strings conforming to a certain syntax rule based on a agreed grammar rule.

A language model (language model) is a mathematical model that describes the probability of a certain word string or character string using a probability distribution.

The check rule may have a plurality of pieces, for example, by determining whether all the sample data are numbers and whether the number of the numbers is a predetermined number of digits, the identification of the private data is performed.

The multi-classification model can be obtained through machine learning and can also be called a neural network model or a deep learning model.

Then, instep 23, if the first identification result indicates that the first field belongs to private data, a second field having a preset relationship with the first field is searched. It will be appreciated that if a first field belongs to private data, then a second field having a predetermined relationship with the first field must belong to private data, or with a greater probability belong to private data.

In one example, the finding a second field having a preset relationship with the first field comprises:

In the embodiments of the present description, the data relationship map may exist in the form of a map database. Graph database: also known as a graph database, is a non-relational database that uses graph theory to store relational information between entities. Compared with a relational database, the graph database can be conveniently and rapidly inquired and can be used for various calculations and reasoning.

The SQL parsing is a basic stone for constructing data lineage relationships, and mainly parses fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, and generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depended); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship. For example, the following SQL:

Create Table1 as

Select identify_no,mobile_no

From Table2；

the blood relationship obtained by SQL analysis includes:

(table2. identity _ no, table1. identity _ no, copy), representing the relationship that the identity field in table1 is duplicated with the identity field in table 2;

(table2.mobile _ no, table1.mobile _ no, copy), representing the relationship that the phone number field in table1 is duplicated with the phone number field in table 2;

(Table2, Table1, depended), representing that Table1 and Table2 are dependencies;

(Table1. identity _ no, Table1, belong), representing the relationship to which the identity field in Table1 belongs to Table 1;

(Table1.mobile _ no, Table1, belong), representing the relationship that the phone number field in Table1 belongs to Table 1;

(Table2.mobile _ no, Table2, belong), representing the relationship that the phone number field in Table2 belongs to Table 2;

(Table2. identity _ no, Table2, belong), representing the relationship that the identity field in Table2 belongs to Table2.

At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.

In the embodiment of the present specification, the second field may be searched by using a depth-first search method, where the depth-first search method accesses a vertex v from a vertex v in the graph; sequentially starting from the non-accessed adjacent points of the vertex v, and performing depth-first traversal on the graph; until vertices in the graph that have a path to vertex v are visited; if the vertex in the graph is not accessed, starting from an unvisited vertex, performing depth-first traversal again until all the vertices in the graph are accessed. It will be appreciated that the vertex v corresponds to the first field.

And finally, instep 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. It can be understood that, whether the second field belongs to the private data or not is identified, the identification mode of the second field is different from that of the first field, and when whether the second field belongs to the private data or not is identified, the preset relation is considered, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data is greatly improved.

In one example, the preset relationship is replication;

It can be understood that, if the first field and the second field are in a copied relationship, on the premise that the first field is already identified as private data, the second field necessarily belongs to the private data, and it is not necessary to identify the private data for the second field, thereby improving the identification efficiency.

In one example, the preset relationship is truncation;

the identifying whether the first field belongs to private data comprises:

It can be understood that, if the first field and the second field are in a truncated relationship, that is, the second field is a substring of the first field, and on the premise that the first field has been identified as the private data, the second field has a higher probability of belonging to the private data, the range of identifying the private data for the second field can be reduced, and the amount of calculation is reduced relative to the amount of calculation for identifying the first field, thereby improving the identification efficiency.

In one example, the first recognition result and/or the second recognition result includes:

In the embodiments of the present specification, the number of types of private data is large, which is also one reason why the efficiency in the general private data identification is low. The type of private data that is common today is shown in table two.

Table two: common private data types

Referring to the table two, because the types of the private data are various, the identification is complex, and the calculation amount is large in general.

According to the method provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.

FIG. 3 illustrates a system architecture diagram for identifying private data in a database, according to one embodiment. Referring to fig. 3, firstly, themetadata parsing module 31 will read and parse relevant data from a metadata table in a database, and place the parsing result in a queue, where thescanner 32 assumes that a thread is used for processing, reads an element from the queue each time, and based on the reading, retrieves corresponding sample data from the database, and then performs data recognition by using a built-in private data recognition module, and if the recognition result is not sensitive data, continues to read the next element from the queue for consumption; and if the identification result is sensitive data, looking up the upstream and downstream elements with the copy relationship from the data blood relationship map by using a depth-first search algorithm, and then labeling the relevant elements with the sensitive data in the database.

Wherein the database decoupling unit: the data for the client is stored in different types of databases, such as MYSQL, ORACLE and the like, and the problems brought by different databases are shielded through a uniform interface.

Built-in private data identification module: and storing various logics for identifying the data sensitive data types, including a regular expression, a language model, a check rule, a multi-classification model and the like.

A scanning logic unit: the scan logic is used for executing scan logic, adopts a mixed scheme of linear scan and tree scan, is based on sequential scan, and is converted into tree scan when sensitive data is identified.

Data sampling logic: for sampling data from the database by metadata to provide a data basis for subsequent scanner identification.

It is understood that the sensitive data is private data.

FIG. 4 shows a schematic diagram of a fast private data scanning method based on consanguinity relations according to one embodiment. Referring to fig. 4, in the embodiment of the present disclosure, a sequential scan is used as a cold start entry, each column of each table is sequentially scanned from top to bottom according to a metadata table, and once a certain sensitive data type is scanned, the table is immediately scanned into a data edge map in a depth-first search manner, and two conditions need to be satisfied during the search: if the edge dependency relationship is a copy relationship, continuing searching downwards until the edge dependency relationship is not the copy relationship, otherwise, backtracking upwards; the searched node needs to be connected with the original node, otherwise, the search is stopped.

As shown in fig. 4, when the column i in table1 is scanned sequentially to be sensitive data, depth-first search is performed in the data blood-level map immediately, and at this time, it is found that there is a duplicate relationship between column 1 in table2 and table n and column i in table1, and then the column 1 in table2 and table n corresponding to the column is directly set to be the same type of sensitive data as column 1 in table1, so that the calculation amount of the built-in private data identification module of the scanner is reduced, and the identification efficiency is improved.

It should be noted that, when querying the data blood-margin map, the same effect can be achieved by performing connected graph traversal using the breadth-preferred search algorithm. In addition, the data relationship can be stored not by using a graph database but by using a general relational database.

According to an embodiment of another aspect, there is also provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus being configured to perform the method provided by the embodiments of the present specification. Fig. 5 shows a schematic block diagram of an apparatus for identifying private data in a database according to one embodiment. As shown in fig. 5, theapparatus 500 includes:

aqueue forming unit 51, configured to form a queue for each field in each data table included in the database;

a first identifyingunit 52, configured to sequentially perform, according to the sequence of each field in the queue obtained by thequeue forming unit 51, a processing operation on a current first field, where the processing operation includes:

a searchingunit 53, configured to search, if the first identification result obtained by the first identifyingunit 52 indicates that the first field belongs to private data, a second field having a preset relationship with the first field;

the second identifyingunit 54 is configured to identify, by using a manner corresponding to the preset relationship, whether the second field found by the searchingunit 53 belongs to the private data, to obtain a second identification result, and use the second identification result as an identification result tag of the second field.

Optionally, as an embodiment, thequeue forming unit 51 is specifically configured to parse a metadata table in the database to obtain field names of the fields, and sort the field names to form the queue.

Further, the first identifyingunit 52 includes:

the acquisition subunit is configured to acquire, from the database, sample data corresponding to the field name of the first field;

and the identification subunit is used for inputting the sample data acquired by the acquisition subunit into a privacy data identification model to obtain the first identification result.

Optionally, as an embodiment, the searchingunit 53 is specifically configured to search, from a data relationship map established in advance, a second field having a preset relationship with the first field; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.

Further, the data relation map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.

Further, the searchingunit 53 is specifically configured to search, starting from the node corresponding to the first field, for a node whose relation corresponding to the connection edge is the preset relation until the relation of the connection edge is not the preset relation, and use a field corresponding to the searched node as the second field.

Optionally, as an embodiment, the preset relationship is replication;

the second identifyingunit 54 is specifically configured to directly determine that the second identification result is that the second field belongs to private data.

Optionally, as an embodiment, the preset relationship is truncation;

the first identifyingunit 52 is specifically configured to respectively identify whether the first field belongs to the private data by using each identification model in the first identification model set, to obtain each first identification sub-result, and comprehensively determine the first identification result according to each first identification sub-result;

the second identifyingunit 54 is specifically configured to identify, by using at least one identification model in a second identification model set, whether the second field belongs to the private data, so as to obtain the second identification result; the second set of recognition models is a subset of the first set of recognition models.

Optionally, as an embodiment, the first recognition result and/or the second recognition result includes:

With the apparatus provided in this specification, first, thequeue forming unit 51 forms a queue for each field in each data table included in the database; then, the first identifyingunit 52 performs processing operations on the current first field in sequence according to the sorting of the fields in the queue, where the processing operations include: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then, when the first identification result indicates that the first field belongs to private data, the searchingunit 53 searches for a second field having a preset relationship with the first field; finally, the second identifyingunit 54 identifies whether the second field belongs to the private data by using a manner corresponding to the preset relationship, so as to obtain a second identification result, and uses the second identification result as an identification result tag of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method of identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:

2. The method of claim 1, wherein said forming a queue of fields in respective data tables comprised by said database comprises:

3. The method of claim 2, wherein the identifying whether the first field belongs to private data comprises:

4. The method of claim 3, wherein the private data recognition model comprises at least one of the following recognition logic:

5. The method of claim 1, wherein said finding a second field having a preset relationship with said first field comprises:

6. The method of claim 5, wherein the data relationship graph further comprises nodes corresponding to data tables, and the connecting edges between the nodes further correspond to relationships between data tables and fields, and relationships between data tables and data tables.

7. The method of claim 5 or 6, wherein the data relationship graph is derived from parsing a Structured Query Language (SQL) statement corresponding to the database.

8. The method of claim 5, wherein said finding a second field having a preset relationship with said first field from a pre-established data relationship graph comprises:

9. The method of claim 1, wherein the predetermined relationship is replication;

10. The method of claim 1, wherein the predetermined relationship is truncation;

the identifying whether the first field belongs to private data comprises:

11. The method of claim 1, wherein the first recognition result and/or the second recognition result comprises:

12. An apparatus to identify private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:

13. The apparatus according to claim 12, wherein the queue forming unit is specifically configured to parse a metadata table in the database to obtain field names of the fields, and form the queue after sorting the field names.

14. The apparatus of claim 13, wherein the first identifying unit comprises:

15. The apparatus of claim 14, wherein the private data recognition model comprises at least one of the following recognition logic:

16. The apparatus according to claim 12, wherein the searching unit is specifically configured to search a second field having a preset relationship with the first field from a pre-established data relationship map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.

17. The apparatus of claim 16, wherein the data relationship graph further comprises nodes corresponding to data tables, and the connecting edges between the nodes further correspond to relationships between data tables and fields, and relationships between data tables and data tables.

18. The apparatus of claim 16 or 17, wherein the data relationship graph is derived from parsing a Structured Query Language (SQL) statement corresponding to the database.

19. The apparatus according to claim 16, wherein the searching unit is specifically configured to search, starting from the node corresponding to the first field, for a node whose relation to the connection edge is the preset relation until the relation to the connection edge is not the preset relation, and use a field corresponding to the searched node as the second field.

20. The apparatus of claim 12, wherein the predetermined relationship is replication;

the second identifying unit is specifically configured to directly determine that the second identification result is that the second field belongs to private data.

21. The apparatus of claim 12, wherein the predetermined relationship is truncation;

the first identification unit is specifically configured to respectively identify whether the first field belongs to the private data by using each identification model in a first identification model set, obtain each first identification sub-result, and comprehensively determine the first identification result according to each first identification sub-result;

the second identification unit is specifically configured to identify whether the second field belongs to the private data by using at least one identification model in a second identification model set, so as to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.

22. The apparatus of claim 12, wherein the first recognition result and/or the second recognition result comprises:

23. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.

24. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-11.