CN114595689A

Movatterモバイル変換

Info

Publication number: CN114595689A
Application number: CN202210212791.9A
Authority: CN
Inventors: 曾壮
Original assignee: Shenzhen Yishi Huolala Technology Co Ltd
Current assignee: Shenzhen Yishi Huolala Technology Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-07
Anticipated expiration: 2042-02-28
Also published as: CN114595689B

Abstract

The invention provides a data processing method, which comprises the following steps: acquiring Chinese remark information corresponding to the field name of the data to be processed; performing word segmentation processing and standardization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information; inputting the characteristic vector into a data category identification model, and determining the data category of the data to be processed; the data category identification model is generated by pre-training based on a Chinese text classification deep learning algorithm; the data category is a lowest level subcategory under at least two levels of pre-constructed category catalogues; and determining the data sensitivity level of the data to be processed according to the data category. The method can be applied to application scenes of data classification and sensitive data management, accurate identification of data classes is realized through a data class identification model constructed based on a Chinese text classification deep learning algorithm, and then the data sensitivity classes are determined according to the data classes, so that powerful technical support is provided for safety control processing of data, safe use of the data and sharing of the data.

Description

Translated fromChinese

数据处理方法、装置、存储介质和计算机设备Data processing method, apparatus, storage medium and computer equipment

技术领域technical field

本发明涉及计算机技术领域，具体而言，本发明涉及一种数据处理方法、装置、计算机可读存储介质和计算机设备。The present invention relates to the field of computer technology, and in particular, the present invention relates to a data processing method, apparatus, computer-readable storage medium, and computer equipment.

背景技术Background technique

随着大数据技术发展，企业将各类数据汇聚，形成统一的数据资源池。数据作为资产面向公司不同用户、社会进行开放使用的同时，增加了敏感数据泄漏和违规使用的风险。因此在数据共享使用中，保障敏感数据的安全和防止数据泄漏尤为重要。现在企业拥有海量的数据，传统的数据统一管控方式难以对企业上亿的数据字段进行细颗粒度的安全管控，无法满足现有数据安全合规需求。With the development of big data technology, enterprises aggregate various data to form a unified data resource pool. As an asset, data is openly used by different users of the company and the society, and at the same time, it increases the risk of sensitive data leakage and illegal use. Therefore, in data sharing and use, it is particularly important to ensure the security of sensitive data and prevent data leakage. Enterprises now have massive amounts of data, and traditional unified data management and control methods are difficult to implement fine-grained security control over hundreds of millions of data fields in enterprises, and cannot meet existing data security compliance requirements.

现有技术中，通过对数据字段名定义数据关键字匹配识别，或对一些有规律的数据值构建正则规则进行匹配识别。例如，对敏感数据中的身份证号、姓名、手机号、地址等类别的数据进行识别时，当监测数据满足约定匹配条件，即判断为敏感数据。然而，上述方法在面对数据类别较少，或数据值单一的情况下可发挥较好效果，但在面对海量的数据场景下，复杂多样的业务经营类数据、用户行为类数据、财务报表类数据等较为复杂的数据种类时，往往存在敏感识别中规则构建难、准确度低、数据识别覆盖面小、扩展性弱等问题。In the prior art, matching and identification are performed by defining data keyword matching for data field names, or constructing regular rules for some regular data values. For example, when identifying data in sensitive data such as ID number, name, mobile phone number, address, etc., when the monitoring data meets the agreed matching conditions, it is judged as sensitive data. However, the above methods can play a better role in the case of few data categories or single data values, but in the face of massive data scenarios, complex and diverse business operation data, user behavior data, financial statements When using more complex data types such as class data, there are often problems such as difficulty in constructing rules in sensitive identification, low accuracy, small data identification coverage, and weak scalability.

发明内容SUMMARY OF THE INVENTION

为至少能解决上述的技术缺陷之一，本发明提供了以下技术方案的数据处理方法及对应的装置、计算机可读存储介质和计算机设备。In order to solve at least one of the above-mentioned technical defects, the present invention provides the data processing method and corresponding apparatus, computer-readable storage medium and computer equipment of the following technical solutions.

本发明的实施例根据一个方面，提供了一种数据处理方法，包括如下步骤：According to one aspect, an embodiment of the present invention provides a data processing method, comprising the following steps:

获取待处理数据的字段名对应的中文备注信息；Get the Chinese remark information corresponding to the field name of the data to be processed;

对所述中文备注信息进行分词处理和标准化处理，得到所述中文备注信息的特征向量；Perform word segmentation and normalization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information;

将所述特征向量输入数据类别识别模型，确定所述待处理数据的数据类别；其中，所述数据类别识别模型基于中文文本分类深度学习算法预先训练生成；所述数据类别为预先构建的至少两级类别目录下的最低层级子类别；The feature vector is input into the data category recognition model to determine the data category of the data to be processed; wherein, the data category recognition model is pre-trained and generated based on a Chinese text classification deep learning algorithm; the data category is at least two pre-built. The lowest level subcategory under the category directory;

根据所述数据类别，确定所述待处理数据的数据敏感等级。According to the data category, the data sensitivity level of the data to be processed is determined.

优选地，所述数据类别识别模型通过以下步骤预先训练生成：Preferably, the data category recognition model is pre-trained and generated by the following steps:

获取所述至少两级类别目录下的最低层级子类别对应的样本数据和数据类别标签；Obtain sample data and data category labels corresponding to the lowest-level subcategory under the at least two-level category directory;

对所述样本数据的字段名对应的中文备注信息进行分词处理和标准化处理，得到训练用特征向量；Perform word segmentation and standardization processing on the Chinese remark information corresponding to the field names of the sample data to obtain a training feature vector;

根据所述训练用特征向量和对应的所述数据类别标签，基于中文文本分类深度学习算法对初始模型进行训练，得到所述数据类别识别模型。According to the training feature vector and the corresponding data category label, the initial model is trained based on the Chinese text classification deep learning algorithm to obtain the data category recognition model.

优选地，所述数据类别识别模型为基于FastText算法预先训练生成的FastText模型。Preferably, the data category recognition model is a FastText model generated by pre-training based on the FastText algorithm.

优选地，所述至少两级类别目录通过以下步骤预先构建：Preferably, the at least two-level category directory is pre-built by the following steps:

获取数据集并对所述数据集中的字段名进行单词切分和单词词频分析，得到单词词频分析结果；Obtaining a data set and performing word segmentation and word frequency analysis on the field names in the data set to obtain a word frequency analysis result;

根据所述单词词频分析结果，确定一级类别；According to the word frequency analysis result, determine the primary category;

基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录。Based on the k-means clustering algorithm, cluster analysis processing is performed on the data under the first-level category to generate at least two-level category catalogs.

优选地，所述基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录，包括：Preferably, the clustering analysis processing is performed on the data under the first-level category based on the k-means clustering algorithm to generate at least two-level category catalogs, including:

基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，确定对应所述一级类别的二级子类别；Perform cluster analysis processing on the data under the first-level category based on the k-means clustering algorithm, and determine the second-level subcategory corresponding to the first-level category;

基于k-means聚类算法对所述二级子类别中数据种类达到预置数量的类别进行聚类分析处理，确定对应所述一级类别的三级子类别，生成三级类别目录。Based on the k-means clustering algorithm, cluster analysis processing is performed on the categories of the second-level subcategories whose data types reach a preset number, the third-level subcategories corresponding to the first-level categories are determined, and a third-level category directory is generated.

优选地，所述根据所述数据类别，确定所述待处理数据的数据敏感等级之后，还包括：Preferably, after determining the data sensitivity level of the data to be processed according to the data category, the method further includes:

根据所述数据敏感等级，对所述待处理数据进行与所述数据敏感等级匹配的安全管控处理。According to the data sensitivity level, the data to be processed is subjected to security management and control processing that matches the data sensitivity level.

优选地，所述数据敏感等级包括敏感程度从低至高的第一等级、第二等级、第三等级和第四等级；Preferably, the data sensitivity level includes a first level, a second level, a third level and a fourth level of sensitivity from low to high;

所述根据所述数据敏感等级，对所述待处理数据进行与所述数据敏感等级匹配的安全管控处理，包括：The performing security management and control processing on the data to be processed that matches the data sensitivity level according to the data sensitivity level includes:

若所述数据敏感等级为第一等级，将所述待处理数据配置为对外部开放，存储于对外使用介质上；If the data sensitivity level is the first level, configure the data to be processed to be open to the outside, and store it on a medium for external use;

若所述数据敏感等级为第二等级，将所述待处理数据配置为仅对内部开放，存储于内部系统；If the data sensitivity level is the second level, configure the to-be-processed data to be only open to the inside and store it in the internal system;

若所述数据敏感等级为第三等级，将所述待处理数据配置为仅对内部相关人员开放，加密存储于内部系统，且加密传输及限制输出；If the data sensitivity level is the third level, configure the to-be-processed data to be open only to internal relevant personnel, encrypted and stored in the internal system, and encrypted for transmission and restricted output;

若所述数据敏感等级为第四等级，将所述待处理数据配置为仅对内部特定人员开放，加密存储于内部系统，且加密传输及限于特定业务场景下使用。If the data sensitivity level is the fourth level, the data to be processed is configured to be open only to specific internal personnel, encrypted and stored in the internal system, and encrypted for transmission and limited to use in specific business scenarios.

此外，本发明的实施例根据另一个方面，提供了一种数据处理装置，包括：In addition, according to another aspect, an embodiment of the present invention provides a data processing apparatus, comprising:

信息获取模块，用于获取待处理数据的字段名对应的中文备注信息；The information acquisition module is used to acquire the Chinese remark information corresponding to the field name of the data to be processed;

分词模块，用于对所述中文备注信息进行分词处理和标准化处理，得到所述中文备注信息的特征向量；A word segmentation module, configured to perform word segmentation and standardization processing on the Chinese remark information, and obtain a feature vector of the Chinese remark information;

类别识别模块，用于将所述特征向量输入数据类别识别模型，确定所述待处理数据的数据类别；其中，所述数据类别识别模型基于中文文本分类深度学习算法预先训练生成；所述数据类别为预先构建的至少两级类别目录下的最低层级子类别；A category identification module, configured to input the feature vector into a data category identification model to determine the data category of the data to be processed; wherein, the data category identification model is pre-trained and generated based on a Chinese text classification deep learning algorithm; the data category is the lowest level subcategory under a pre-built at least two level category directory;

敏感等级确定模块，用于根据所述数据类别，确定所述待处理数据的数据敏感等级。A sensitivity level determination module, configured to determine the data sensitivity level of the data to be processed according to the data category.

本发明的实施例根据又一个方面，提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现上述的数据处理方法。According to yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned data processing method is implemented.

本发明的实施例根据再一个方面，提供了一种计算机设备，所述计算机包括一个或多个处理器；存储器；一个或多个计算机程序，其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个计算机程序配置用于：执行上述的数据处理方法。According to yet another aspect, an embodiment of the present invention provides a computer apparatus, the computer comprising one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: perform the data processing method described above.

本发明与现有技术相比，具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明提供的数据处理方法、装置、计算机可读存储介质和计算机设备，能够应用于数据分类分级、敏感数据管理的应用场景，通过基于中文文本分类深度学习算法构建的数据类别识别模型实现数据类别的精准识别，进而根据数据类别确定数据敏感等级，相比于传统方法中依赖于字段名构建关键词匹配规则和数据值正则判断的识别方法，能够避免敏感识别中规则构建难、准确度低、数据识别覆盖面小等问题，通过使用数据类别识别模型对中文备注信息提取特征，实现数据的高维特征分析和挖掘，识别精度较高，且基于中文备注信息的数据类别识别模型具有较高的泛化能力，能对全量数据的数据类别有准确的识别，数据识别覆盖率高，为对数据进行安全管控处理、实现数据安全使用和共享提供有力的技术支持。The data processing method, device, computer-readable storage medium and computer equipment provided by the present invention can be applied to the application scenarios of data classification and grading and sensitive data management. Compared with the traditional method that relies on the field name to construct keyword matching rules and data value regular judgment, it can avoid the difficulty of rule construction, low accuracy, and low accuracy in sensitive identification. For problems such as the small coverage of data identification, by using the data category identification model to extract features from the Chinese remark information, high-dimensional feature analysis and mining of the data are realized, and the identification accuracy is high, and the data category identification model based on the Chinese memo information has a high generalization. It can accurately identify the data categories of the full amount of data, and the data identification coverage is high, providing strong technical support for the safe management and control of data, and the realization of safe use and sharing of data.

本发明附加的方面和优点将在下面的描述中部分给出，这些将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth in part in the following description, which will be apparent from the following description, or may be learned by practice of the present invention.

附图说明Description of drawings

本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为本发明实施例提供的数据处理方法的方法流程图；1 is a method flowchart of a data processing method provided by an embodiment of the present invention;

图2为本发明实施例提供的构建至少两级类别目录的方法流程图；2 is a flowchart of a method for constructing at least two-level category directories provided by an embodiment of the present invention;

图3为本发明实施例提供的训练数据类别识别模型的方法流程图；3 is a flowchart of a method for training a data category identification model provided by an embodiment of the present invention;

图4为本发明实施例提供的数据处理装置的结构示意图。FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能解释为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, but not to be construed as a limitation of the present invention.

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "said" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combination of one or more of the associated listed items.

本技术领域技术人员可以理解，除非另外定义，这里使用的所有术语(包括技术术语和科学术语)，具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是，诸如通用字典中定义的那些术语，应该被理解为具有与现有技术的上下文中的意义一致的意义，并且除非像这里一样被特定定义，否则不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms, such as those defined in a general dictionary, should be understood to have meanings consistent with their meanings in the context of the prior art and, unless specifically defined as herein, should not be interpreted in idealistic or overly formal meaning to explain.

本发明实施例提供了一种数据处理方法，能够应用于数据分类分级、敏感数据管理的应用场景。具体地，实现数据分类分级的数据处理方法，是在大数据系统中针对海量数据的高效的数据分类分级管理，是结合数据的权限管控、数据脱敏和安全审计的基于大数据系统的服务提供方法。通过对全量数据的分类分级、数据脱敏与权限的管控服务相结合，为实现数据安全服务提供一种有效的技术支撑，并在实际生产应用中取得了较好的效果。The embodiment of the present invention provides a data processing method, which can be applied to application scenarios of data classification and classification and sensitive data management. Specifically, the data processing method that realizes data classification and classification is an efficient data classification and classification management for massive data in a big data system, and is a big data system-based service provision that combines data authority control, data desensitization and security auditing. method. By combining the classification and grading of full data, data desensitization and permission management and control services, it provides an effective technical support for the realization of data security services, and has achieved good results in actual production applications.

如图1所示，本发明实施例提供的数据处理方法包括以下步骤：As shown in FIG. 1, the data processing method provided by the embodiment of the present invention includes the following steps:

步骤S110：获取待处理数据的字段名对应的中文备注信息。Step S110: Acquire Chinese remark information corresponding to the field name of the data to be processed.

对于本实施例，所述待处理数据为待进行数据分类分级的数据，其中，数据分类分级是指先将数据归类到其所属数据类别，并进一步根据其所属数据类别来确定该数据的数据敏感等级。For this embodiment, the data to be processed is the data to be classified and graded, wherein the data classification and classification refers to first classifying the data into the data category to which it belongs, and further determining the data sensitivity of the data according to the data category to which the data belongs. grade.

当平台或系统中的数据有数据分类分级和/或敏感数据管理需求时，则获取所述待处理数据的字段名对应的中文备注信息。其中，所述待处理数据按表为组织单位存储在数据库中，表中的每一列就是字段，所述字段名是指以关系模型为数据结构的表中每一列的标识，所述字段名可由建库者自定义设置，通常由英文字母，或者加上自然数字、下划线组成。在本发明实施例中，为了使得字段名的含义更加清晰准确，各个字段名预先设置有用于解释其含义的中文备注信息。When the data in the platform or system has requirements for data classification and grading and/or sensitive data management, the Chinese remark information corresponding to the field name of the data to be processed is obtained. Wherein, the data to be processed is stored in the database in a table as an organizational unit, each column in the table is a field, the field name refers to the identifier of each column in the table with the relational model as the data structure, and the field name can be defined by The library builder customizes the settings, usually composed of English letters, or plus natural numbers and underscores. In the embodiment of the present invention, in order to make the meaning of the field names clearer and more accurate, each field name is preset with Chinese remark information for explaining its meaning.

例如，字段名mobile_telephone对应的中文备注信息为手机号码。For example, the Chinese remark information corresponding to the field name mobile_telephone is the mobile phone number.

步骤S120：对所述中文备注信息进行分词处理和标准化处理，得到所述中文备注信息的特征向量。Step S120: Perform word segmentation and normalization processing on the Chinese remark information to obtain a feature vector of the Chinese remark information.

对于本实施例，采用中文分词技术对步骤S110获取得到的中文备注信息进行分词处理，同时去除其中的停用词，避免停用词对后续模型分类结果造成干扰，将中文备注信息切分成为一个或多个词。For this embodiment, the Chinese word segmentation technology is used to perform word segmentation processing on the Chinese remark information obtained in step S110, and the stop words in it are removed at the same time, so as to prevent the stop words from interfering with the subsequent model classification results, and the Chinese remark information is divided into one or more words.

然后，对切分得到的一个或多个词进行标准化处理，得到所述中文备注信息的特征向量，将所述中文备注信息的特征向量用于作为数据类别识别模型的输入数据。Then, standardize the segmented word or words to obtain the feature vector of the Chinese remark information, and use the feature vector of the Chinese remark information as the input data of the data category recognition model.

在其他实施例中，对所述中文备注信息进行分词处理之后，还可以在切分得到的词基础上，根据具体业务场景增加与业务相关联的高频特殊名词，以避免在对中文备注信息进行切分时遗漏与业务相关的特殊关键字。然后，对切分得到的一个或多个词和增加的高频特殊名词进行标准化处理，得到所述中文备注信息的特征向量。通过增加高频特殊名词后生成中文备注信息的特征向量，将该特征向量用于作为数据类别识别模型的输入数据，能够为提高识别精确度提供有力的技术支持。In other embodiments, after word segmentation is performed on the Chinese remark information, high-frequency special nouns associated with the business may be added on the basis of the segmented words according to the specific business scenario, so as to avoid the need for the Chinese remark information Special keywords related to the business are omitted from the segmentation. Then, standardize the one or more words and the added high-frequency special nouns obtained by segmentation to obtain the feature vector of the Chinese remark information. By adding high-frequency special nouns, a feature vector of Chinese remarks information is generated, and the feature vector is used as the input data of the data category recognition model, which can provide strong technical support for improving the recognition accuracy.

步骤S130：将所述特征向量输入数据类别识别模型，确定所述待处理数据的数据类别；其中，所述数据类别识别模型基于中文文本分类深度学习算法预先训练生成；所述数据类别为预先构建的至少两级类别目录下的最低层级子类别。Step S130: Input the feature vector into a data category identification model to determine the data category of the data to be processed; wherein, the data category identification model is pre-trained and generated based on a Chinese text classification deep learning algorithm; the data category is pre-built The lowest level subcategory under at least two levels of category directory.

对于本实施例，所述数据类别识别模型是通过关于中文文本分类的深度学习算法预先训练生成的。For this embodiment, the data category recognition model is generated by pre-training a deep learning algorithm for Chinese text classification.

其中，所述中文文本分类深度学习算法包括FastText(一个开源的词向量计算和文本分类工具)、TextCNN(文本卷积神经网络)、TextRNN(文本循环神经网络)、RCNN(区域卷积神经网络)、HAN(Hierarchical Attention Network)、BERT(Bidirectional EncoderRepresentations from Transformers)等，本发明实施例对此不作限定。Among them, the deep learning algorithms for Chinese text classification include FastText (an open source word vector calculation and text classification tool), TextCNN (Text Convolutional Neural Network), TextRNN (Text Recurrent Neural Network), RCNN (Regional Convolutional Neural Network) , HAN (Hierarchical Attention Network), BERT (Bidirectional Encoder Representations from Transformers), etc., which are not limited in this embodiment of the present invention.

通过将中文备注信息的特征向量输入所述数据类别识别模型，经由所述数据类别识别模型的输入层、中间层和输出层，提取特征并映射得到对应的数据类别标签，从而确定所述待处理数据的数据类别。By inputting the feature vector of the Chinese remark information into the data category recognition model, through the input layer, middle layer and output layer of the data category recognition model, extract features and map to obtain the corresponding data category labels, so as to determine the to-be-processed The data category of the data.

中文备注信息相比字段名具有更高的类别区分度，通过使用数据类别识别模型对中文备注信息提取特征，实现数据的高维特征分析和挖掘，识别精度较高。Compared with field names, Chinese memo information has a higher degree of category discrimination. By using the data category recognition model to extract features from Chinese memo information, high-dimensional feature analysis and mining of data can be realized, and the recognition accuracy is high.

此外，基于中文备注信息的数据类别识别模型具有较高的泛化能力，能对全量数据的数据类别有准确的识别，数据识别覆盖率高。In addition, the data category recognition model based on Chinese remark information has high generalization ability, can accurately identify the data category of the full amount of data, and has a high data identification coverage.

对于本实施例，所述至少两级类别目录可以是二级类别目录、三级类别目录、四级类别目录、五级类别目录甚至是十级类别目录，所述至少两级类别目录的目录树层级数可根据企业的业务场景和国家分类分级标准自定义设置，本发明实施例对所述至少两级类别目录的目录树层级数的具体数值不作限定。通过根据企业的业务场景和国家分类分级标准预先构建至少两级类别目录，划分出足够细颗粒度的数据类别目录，能够有效应对需对数据种类繁多的海量数据进行分类分级的应用场景。For this embodiment, the at least two-level category directory may be a second-level category directory, a third-level category directory, a fourth-level category directory, a fifth-level category directory, or even a ten-level category directory, and the directory tree of the at least two-level category directory The number of levels can be customized and set according to the business scenario of the enterprise and the national classification and grading standards, and the embodiment of the present invention does not limit the specific value of the number of levels of the directory tree of the at least two-level category directory. By pre-constructing at least two-level category catalogs according to the business scenarios of the enterprise and the national classification and grading standards, and dividing the data category catalogs with sufficient granularity, it can effectively cope with the application scenarios that require classification and grading of massive data of various types.

对于本实施例，所确定的所述待处理数据的数据类别是预先构建的至少两级类别目录下的最低层级子类别。所述最低层级类别是指该层级类别下不再具有分支，是当前类别目录树分支下的最低一级。For this embodiment, the determined data category of the data to be processed is the lowest-level subcategory under the pre-built at least two-level category directory. The lowest level category means that there is no branch under this level category, and it is the lowest level under the branch of the current category tree.

例如，预先构建的是两层类别目录，则所述两层类别目录包括一级类别和具有分支的一级类别下的二级子类别，所确定的所述待处理数据的数据类别是不具有分支的一级类别或者是具有分支的一级类别下的二级子类别。For example, if a two-level category directory is pre-built, the two-level category directory includes a first-level category and a second-level subcategory under the first-level category with branches, and the determined data category of the data to be processed is not A first-level category with branches or a second-level subcategory under a first-level category with branches.

在其他实施例中，在基于数据类别识别模型确定所述待处理数据的数据类别之后，还可进行人工核验，以核实所述数据类别识别模型的数据类别识别的准确性。In other embodiments, after the data category of the data to be processed is determined based on the data category identification model, manual verification may also be performed to verify the accuracy of the data category identification of the data category identification model.

步骤S140：根据所述数据类别，确定所述待处理数据的数据敏感等级。Step S140: Determine the data sensitivity level of the data to be processed according to the data category.

对于本实施例，可根据平台或系统其业务范围内的数据资产重要程度、利用价值、影响范围等方面的综合分析，预先按数据敏感程度高低划分有若干个数据敏感等级，其可以是两级、三级、五级、八级等数量，本发明实施例对划分的数据敏感等级的具体数量不作限定。For this embodiment, according to the comprehensive analysis of the importance, utilization value, and influence scope of the data assets within the business scope of the platform or system, there are several data sensitivity levels preliminarily divided according to the level of data sensitivity, which may be two levels. , three levels, five levels, eight levels, etc., the embodiment of the present invention does not limit the specific number of the divided data sensitivity levels.

对于本实施例，预先对所述至少两级类别目录下的最低层级子类别设置有对应的数据敏感等级。具体地，在实际应用场景中，可依据所述至少两级类别目录，结合数据类别的所属范围和涉及的业务场景，对所述最低层级子类别进行数据敏感等级设置，按敏感程度高低分为若干个等级。For this embodiment, a corresponding data sensitivity level is preset for the lowest-level sub-category under the at least two-level category directory. Specifically, in an actual application scenario, according to the at least two-level category catalogue, combined with the scope of the data category and the business scenario involved, the data sensitivity level can be set for the lowest-level subcategory, which is divided into two categories according to the level of sensitivity. several levels.

对于本实施例，当基于数据类别识别模型确定所述待处理数据的数据类别之后，根据基于预先设置的数据类别与数据敏感等级的对应关系，确定所述待处理数据的数据敏感等级。For this embodiment, after the data category of the data to be processed is determined based on the data category identification model, the data sensitivity level of the to-be-processed data is determined according to the preset correspondence between the data category and the data sensitivity level.

本发明提供的数据处理方法，能够应用于数据分类分级、敏感数据管理的应用场景，通过基于中文文本分类深度学习算法构建的数据类别识别模型实现数据类别的精准识别，进而根据数据类别确定数据敏感等级，相比于传统方法中依赖于字段名构建关键词匹配规则和数据值正则判断的识别方法，能够避免敏感识别中规则构建难、准确度低、数据识别覆盖面小等问题，通过使用数据类别识别模型对中文备注信息提取特征，实现数据的高维特征分析和挖掘，识别精度较高，且基于中文备注信息的数据类别识别模型具有较高的泛化能力，能对全量数据的数据类别有准确的识别，数据识别覆盖率高，为对数据进行安全管控处理、实现数据安全使用和共享提供有力的技术支持。The data processing method provided by the present invention can be applied to the application scenarios of data classification and grading and sensitive data management. Accurate identification of data categories is realized through a data category identification model constructed based on a Chinese text classification deep learning algorithm, and then data sensitive data is determined according to the data category. Compared with the traditional identification method that relies on field names to construct keyword matching rules and data value regular judgment, it can avoid the problems of difficult rule construction, low accuracy, and small data identification coverage in sensitive identification. The recognition model extracts features from Chinese remarks information, realizes high-dimensional feature analysis and mining of data, and has high recognition accuracy, and the data category recognition model based on Chinese remarks information has high generalization ability, and can be used for all types of data. Accurate identification and high data identification coverage provide strong technical support for the safe management and control of data and the realization of safe use and sharing of data.

在一些实施例中，如图2所示，所述至少两级类别目录通过以下步骤预先构建：In some embodiments, as shown in FIG. 2, the at least two-level category directory is pre-built by the following steps:

步骤S210：获取数据集并对所述数据集中的字段名进行单词切分和单词词频分析，得到单词词频分析结果。Step S210: Acquire a data set and perform word segmentation and word frequency analysis on the field names in the data set to obtain a word frequency analysis result.

对于本实施例，当构建与平台业务相关的至少两级类别目录时，先通过对平台或系统中已存储的全量数据随机抽取字段信息，包括表名、库名、字段名、字段备注等，随机抽样获取部分数据集，通过对数据集中的字段名进行单词切分，将字段名切分成多个单词，并进行单词词频分析，统计各单词的词频数据，得到单词词频分析结果。For this embodiment, when constructing at least two-level category catalogues related to platform business, first randomly extract field information from the full amount of data stored in the platform or system, including table name, library name, field name, field remarks, etc. Random sampling to obtain part of the data set, through word segmentation of the field names in the data set, the field names are divided into multiple words, and word frequency analysis is performed, the word frequency data of each word is counted, and the word frequency analysis result is obtained.

步骤S220：根据所述单词词频分析结果，确定一级类别。Step S220: Determine the primary category according to the word frequency analysis result.

对于本实施例，根据能够体现各单词词频大小的单词词频分析结果，分析单词词频大的重要词，对重要词划分出其所属的数据类别，并依据企业的业务场景和国家分类分级标准，确定一级类别，形成一级类别目录。For this embodiment, according to the word frequency analysis result that can reflect the word frequency of each word, analyze the important words with large word frequency, classify the important words into the data category to which they belong, and determine according to the business scenario of the enterprise and the national classification and grading standards. The first-level category forms the first-level category directory.

例如，所确定的一级类别为用户基本信息类、设备信息类、账户信息类、企业财务类等。For example, the determined first-level categories are user basic information category, equipment information category, account information category, enterprise financial category, and the like.

步骤S230：基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录。Step S230: Perform cluster analysis processing on the data under the first-level category based on the k-means clustering algorithm to generate at least two-level category catalogs.

对于本实施例，根据形成的一级类别目录，对所述一级类别中每个类别下的数据进行更细颗粒度的划分。具体地，首先对获取的一级类别中每个类别下的数据字段名进行向量化，使用无监督学习k-means聚类算法对向量化的数据进行聚类，细分出多个子类簇，并对每个子类簇下的数据进行标签化，得到二级子类别，生成两级类别目录。For this embodiment, according to the formed first-level category directory, the data under each category in the first-level category is divided into finer granularity. Specifically, firstly, vectorize the data field names under each category in the acquired first-level categories, use the unsupervised learning k-means clustering algorithm to cluster the vectorized data, and subdivide multiple sub-category clusters. And label the data under each sub-category cluster to obtain the second-level sub-category, and generate a two-level category directory.

此外，还可以基于k-means聚类算法对子类别下数据种类较多的簇进行一次或多次聚类分析处理，把子类别下数据种类较多的簇进一步划分，生成多级类别目录。In addition, based on the k-means clustering algorithm, one or more clustering analysis processing can be performed on the clusters with more types of data under the sub-categories, and the clusters with more types of data under the sub-categories can be further divided to generate a multi-level category directory.

对于本实施例，所述至少两级类别目录可以是二级类别目录、三级类别目录、四级类别目录、五级类别目录甚至是十级类别目录，所述至少两级类别目录的目录树层级数可根据企业的业务场景和国家分类分级标准自定义设置，本发明实施例对所述至少两级类别目录的目录树层级数的具体数值不作限定。For this embodiment, the at least two-level category directory may be a second-level category directory, a third-level category directory, a fourth-level category directory, a fifth-level category directory, or even a ten-level category directory, and the directory tree of the at least two-level category directory The number of levels can be customized and set according to the business scenario of the enterprise and the national classification and grading standards, and the embodiment of the present invention does not limit the specific value of the number of levels of the directory tree of the at least two-level category directory.

在本实施例中，在面对全量数据的场景下，通过字段名的词频分析和k-means聚类算法对数据类别的细分，根据企业的业务场景和国家分类分级标准构建至少两级类别目录，能够划分出足够细颗粒度的数据类别目录，能够有效应对需对数据种类繁多的海量数据进行分类分级的应用场景。In this embodiment, in the face of the full amount of data, the word frequency analysis of field names and the k-means clustering algorithm are used to subdivide the data categories, and at least two-level categories are constructed according to the business scenarios of the enterprise and the national classification and grading standards. The catalog can be divided into sufficiently fine-grained data category catalogs, which can effectively deal with application scenarios that require classification and classification of massive data with a wide variety of data.

此外，通过对所述至少两级类别目录下最低层级子类别的数据进行清洗，去除噪音数据，将干扰数据剔除，并对数据打上数据类别标签，最终形成完整的可作为数据资产类别目录的至少两级类别目录和最低层级子类别对应的样本数据集。In addition, by cleaning the data of the lowest-level sub-category under the at least two-level category catalog, removing noise data, eliminating interfering data, and labeling the data with a data category, a complete at least a category catalogue that can be used as a data asset is finally formed. A sample dataset corresponding to a two-level category directory and the lowest-level subcategory.

在一些实施例中，三层作为优选的目录层级数量，所述至少两级类别目录具体为三级类别目录。In some embodiments, three levels are used as the preferred number of directory levels, and the at least two-level category directory is specifically a three-level category directory.

对于本实施例，对于三级类别目录，所述步骤S230基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录，具体为：基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，确定对应所述一级类别的二级子类别；基于k-means聚类算法对所述二级子类别中数据种类达到预置数量的类别进行聚类分析处理，确定对应所述一级类别的三级子类别，生成三级类别目录。For the present embodiment, for the three-level category directory, the step S230 performs cluster analysis processing on the data under the first-level category based on the k-means clustering algorithm, and generates at least two-level category directory, specifically: based on the k-means clustering algorithm The means clustering algorithm performs cluster analysis on the data under the first-level category, and determines the second-level sub-category corresponding to the first-level category; based on the k-means clustering algorithm, the data types in the second-level sub-category reach A preset number of categories is subjected to cluster analysis processing, and a third-level sub-category corresponding to the first-level category is determined to generate a third-level category directory.

对于本实施例，首先对获取的一级类别中每个类别下的数据字段名进行向量化，使用无监督学习k-means聚类算法对向量化的数据进行聚类，细分出多个子类簇，并对每个子类簇下的数据进行标签化，得到二级子类别，生成两级类别目录；然后，基于k-means聚类算法对二级子类别下数据种类达到预置数量的类别进行再次聚类分析处理，把二级子类别下数据种类达到预置数量的类别进一步划分，得到三级子类别，生成三级类别目录。For this embodiment, firstly, vectorize the data field names under each category in the acquired first-level categories, use the unsupervised learning k-means clustering algorithm to cluster the vectorized data, and subdivide into multiple sub-categories Then, based on the k-means clustering algorithm, the data types under the second-level sub-categories reach the preset number of categories. Perform clustering analysis processing again, and further divide the categories whose data types under the second-level sub-categories reach the preset number to obtain the third-level sub-categories, and generate a third-level category directory.

其中，所述预置数量可根据实际应用需求确定设置为任意大于1的数值，本实施例对所述预置数量的具体数值不做限定。例如，可对二级子类别下数据种类达到两种的类别进行再次聚类分析处理。The preset number may be determined to be any value greater than 1 according to actual application requirements, and the specific value of the preset number is not limited in this embodiment. For example, it is possible to perform clustering analysis processing again on the categories with two types of data under the second-level subcategory.

例如，在三级类别目录下，作为一级类别的用户基本信息下包含联系方式、性别、年龄、姓名、居住地址等二级子类别联系方式，作为二级子类别的联系方式包含手机号、座机号、邮箱号等三级子类别。For example, in the third-level category directory, the basic user information as a first-level category includes contact information of second-level subcategories such as contact information, gender, age, name, and residence address, and the contact information as a second-level subcategory includes mobile phone numbers, Three-level sub-categories such as landline number and mailbox number.

在一些实施例中，如图3所示，所述数据类别识别模型通过以下步骤预先训练生成：In some embodiments, as shown in FIG. 3 , the data category recognition model is pre-trained and generated by the following steps:

步骤S310：获取所述至少两级类别目录下的最低层级子类别对应的样本数据和数据类别标签。Step S310: Obtain sample data and data category labels corresponding to the lowest-level subcategory under the at least two-level category directory.

对于本实施例，构建所述至少两级类别目录时形成有所述至少两级类别目录的最低层级子类别对应的样本数据集，则本步骤基于所述至少两级类别目录下最低层级子类别对应样本数据集中的样本数据和数据类别标签构建数据类别识别模型。For this embodiment, when the at least two-level category directory is constructed, a sample data set corresponding to the lowest-level sub-category of the at least two-level category directory is formed, then this step is based on the lowest-level sub-category under the at least two-level category directory A data category recognition model is constructed corresponding to the sample data and data category labels in the sample data set.

步骤S320：对所述样本数据的字段名对应的中文备注信息进行分词处理和标准化处理，得到训练用特征向量。Step S320: Perform word segmentation processing and normalization processing on the Chinese remark information corresponding to the field names of the sample data to obtain a training feature vector.

对于本实施例，在获取得所述最低层级子类别对应的样本数据之后，采用中文分词技术对所述样本数据的字段名的中文备注信息进行分词处理，同时去除其中的停用词，避免停用词对模型训练结果造成干扰，将中文备注信息切分成为一个或多个词。For this embodiment, after obtaining the sample data corresponding to the lowest-level sub-category, the Chinese word segmentation technology is used to perform word segmentation processing on the Chinese remark information of the field name of the sample data, and at the same time, the stop words are removed to avoid stop words. The use of words interferes with the model training results, and the Chinese remarks are divided into one or more words.

然后，对切分得到的一个或多个词进行标准化处理，得到训练用特征向量，将所述训练用特征向量用于作为所述数据类别识别模型的训练数据。Then, one or more words obtained by segmentation are standardized to obtain a training feature vector, and the training feature vector is used as training data for the data category recognition model.

在其他实施例中，对所述样本数据的字段名的中文备注信息进行分词处理之后，还可以在切分得到的词基础上，根据具体业务场景增加与业务相关联的高频特殊名词，以避免在对所述样本数据的字段名的中文备注信息进行切分时遗漏与业务相关的特殊关键字。然后，对切分得到的一个或多个词和增加的高频特殊名词进行标准化处理，得到所述训练用特征向量。通过增加高频特殊名词后生成训练用特征向量，将该训练用特征向量用于作为数据类别识别模型的训练数据，能够为训练出识别精确度高的数据类别识别模型提供有力的技术支持。In other embodiments, after word segmentation is performed on the Chinese remark information of the field names of the sample data, on the basis of the words obtained by segmentation, high-frequency special nouns associated with the business may be added according to specific business scenarios, so as to Avoid omitting special keywords related to business when segmenting the Chinese remark information of the field name of the sample data. Then, standardize the one or more words and the added high-frequency special nouns obtained by segmentation to obtain the training feature vector. After adding high-frequency special nouns, a training feature vector is generated, and the training feature vector is used as the training data of the data category recognition model, which can provide strong technical support for training a data category recognition model with high recognition accuracy.

步骤S330：根据所述训练用特征向量和对应的所述数据类别标签，基于中文文本分类深度学习算法对初始模型进行训练，得到所述数据类别识别模型。Step S330: According to the training feature vector and the corresponding data category label, the initial model is trained based on the Chinese text classification deep learning algorithm, and the data category recognition model is obtained.

根据所述训练用特征向量和对应的所述数据类别标签训练对应中文文本分类深度学习算法的初始模型，求解出最优的模型参数，得到所述数据类别识别模型。According to the training feature vector and the corresponding data category label, the initial model of the corresponding Chinese text classification deep learning algorithm is trained, and the optimal model parameters are solved to obtain the data category recognition model.

在一些实施例中，通过加入新增数据类别的数据集在样本数据集中，并重新按上述步骤S310至步骤S320训练上述数据类别识别模型，进而满足类别数据迭代更新的应用需求，显著增强数据类别识别的可扩展性。In some embodiments, by adding the data set of the newly added data category to the sample data set, and retraining the above-mentioned data category recognition model according to the above steps S310 to S320, so as to meet the application requirements of iterative update of category data, and significantly enhance the data category Extensibility of recognition.

对于本实施例，在实际应用时，平台或系统因业务的扩展或类别细化导致识别数据类别增加，需要实现对新增的类别数据的迭代识别，通过对新增数据的重新训练模型，可实现新增类别数据识别功能。通过对作为数据资产目录的至少两级类别目录进行更新，再从新增类别数据集中获取到样本数据，融合到上述样本数据集中，重新训练数据类别识别模型，得到新的数据类别识别模型。For this embodiment, in actual application, the platform or system increases the identification data category due to business expansion or category refinement, and it is necessary to realize iterative identification of the newly added category data. By retraining the model for the newly added data, the Implement the new category data recognition function. By updating at least two-level category catalogs as data asset catalogs, and then obtaining sample data from the newly added category data set, merging into the above-mentioned sample data set, and retraining the data category recognition model, a new data category recognition model is obtained.

在一些实施例中，所述中文文本分类深度学习算法具体为FastText算法，所述数据类别识别模型为基于FastText算法预先训练生成的FastText模型。FastText模型结构简单，适合大型数据且具备高效的训练速度，能够有效应对需对数据种类繁多的海量数据进行分类分级的应用场景。In some embodiments, the Chinese text classification deep learning algorithm is specifically a FastText algorithm, and the data category recognition model is a FastText model generated by pre-training based on the FastText algorithm. The FastText model has a simple structure, is suitable for large data and has efficient training speed, and can effectively deal with application scenarios that require classification and grading of massive data with various types of data.

在一些实施例中，步骤S130中将所述特征向量输入数据类别识别模型，确定所述待处理数据的数据类别，具体为将所述特征向量输入FastText模型，确定所述待处理数据的数据类别。In some embodiments, in step S130, the feature vector is input into a data category recognition model to determine the data category of the data to be processed, specifically, the feature vector is input into a FastText model to determine the data category of the to-be-processed data .

对于本实施例，FastText模型包括输入层、隐藏层和SoftMax层，其中，输入层获取到中文备注信息中词与词组的特征向量，并通过线性变换映射到隐藏层单元，最后通过SoftMax层将结果映射到数据类别标签，实现待处理数据的数据类别识别。通过使用FastText模型对中文备注信息提取特征并确定数据类别识别，速度快且精确度高，能够有效应对需对数据种类繁多的海量数据进行分类分级的应用场景。For this embodiment, the FastText model includes an input layer, a hidden layer and a SoftMax layer, wherein the input layer obtains the feature vectors of words and phrases in the Chinese remarks information, and maps them to the hidden layer units through linear transformation, and finally uses the SoftMax layer to convert the result. Map to the data category label to realize the data category identification of the data to be processed. By using the FastText model to extract features from Chinese remarks and identify data categories, the speed is fast and the accuracy is high, which can effectively deal with application scenarios that require classification and grading of massive data with a wide variety of data types.

在一些实施例中，所述步骤S140根据所述数据类别，确定所述待处理数据的数据敏感等级之后，还包括：根据所述数据敏感等级，对所述待处理数据进行与所述数据敏感等级匹配的安全管控处理。In some embodiments, after determining the data sensitivity level of the data to be processed according to the data category, the step S140 further includes: according to the data sensitivity level, performing a data sensitivity analysis on the data to be processed with the data sensitivity level. Level-matched security management and control.

对于本实施例，根据平台或系统对数据安全管控的应用需求，预先对所划分的若干个数据敏感等级设置对应的安全管控处理，则平台或系统可对数据敏感程度不同的数据采用不同的安全管控处理措施。For this embodiment, according to the application requirements of the platform or system for data security management and control, the corresponding security management and control processing is set in advance for several divided data sensitivity levels, then the platform or system can adopt different security measures for data with different data sensitivity degrees. control measures.

其中，对数据进行安全管控处理的管控项包括但不限于：公开权限、存储位置、是否加密存储、是否加密传输、应用权限。Among them, the control items for the security management and control of data include but are not limited to: disclosure authority, storage location, whether to encrypt storage, whether to encrypt transmission, and application permissions.

对于本实施例，在确定待处理数据的数据敏感等级之后，可根据预先设置的数据敏感等级与安全管控处理措施的对应关系，确定适合当前待处理数据的安全管控处理措施，并对所述待处理数据进行与所述数据敏感等级匹配的安全管控处理。For this embodiment, after the data sensitivity level of the data to be processed is determined, the corresponding relationship between the preset data sensitivity level and the security control measures can be used to determine the security control measures suitable for the current data to be processed. The processed data is subjected to security management and control processing that matches the data sensitivity level.

在本实施例中，能够应用于数据分类分级、敏感数据管理的应用场景，通过基于中文文本分类深度学习算法构建的数据类别识别模型实现数据类别的精准识别，进而根据数据类别确定数据敏感等级，并对数据进行与其数据敏感等级匹配的安全管控处理，该方法通过使用数据类别识别模型对中文备注信息提取特征，实现数据的高维特征分析和挖掘，识别精度较高，且基于中文备注信息的数据类别识别模型具有较高的泛化能力，能对全量数据的数据类别有准确的识别，数据识别覆盖率高，进而实现对海量数据进行细颗粒度的安全管控，实现数据的安全使用和共享。In this embodiment, it can be applied to the application scenarios of data classification and grading and sensitive data management. The data classification recognition model constructed based on the Chinese text classification deep learning algorithm realizes the accurate recognition of the data classification, and then determines the data sensitivity level according to the data classification. The data is subject to security management and control processing that matches its data sensitivity level. This method extracts features from Chinese remarks by using a data category recognition model to achieve high-dimensional feature analysis and mining of data, with high recognition accuracy, and based on Chinese remarks. The data category identification model has a high generalization ability, can accurately identify the data category of the full amount of data, and has a high data identification coverage rate, thereby realizing fine-grained security management and control of massive data, and realizing the safe use and sharing of data. .

在一些实施例中，所述数据敏感等级包括敏感程度从低至高的第一等级、第二等级、第三等级和第四等级。In some embodiments, the data sensitivity levels include a first level, a second level, a third level, and a fourth level of sensitivity from low to high.

对于本实施例，根据平台或系统其业务范围内的数据资产重要程度、利用价值、影响范围等方面的综合分析，优选地按数据敏感程度高低划分有四个数据敏感等级，分别为敏感程度从低至高的第一等级、第二等级、第三等级和第四等级。For this embodiment, according to the comprehensive analysis of the importance, utilization value, influence scope, etc. of data assets within the business scope of the platform or system, there are preferably four data sensitivity levels according to the level of data sensitivity. Low to High First, Second, Third, and Fourth.

以下示出一个数据敏感等级与安全管控处理措施的对应关系的例子：The following shows an example of the correspondence between data sensitivity levels and security control measures:

对于数据敏感等级为第一等级的待处理数据，为可公开数据，无被利用价值，重要程度一般，则被配置为可对外部开放，并存储于可对外使用介质上。For the data to be processed whose data sensitivity level is the first level, it is open data, has no value to be used, and is of average importance. It is configured to be open to the outside world and stored on a medium that can be used externally.

对于数据敏感等级为第二等级的待处理数据，为限制数据，低估值的被利用价值，为较敏感数据，应仅限于企业内部使用，则被配置为仅对内部开放，并存储在企业业务系统的内部系统。For the data to be processed whose data sensitivity level is the second level, in order to limit the data, the undervalued utilization value is more sensitive data and should be limited to the internal use of the enterprise. Internal systems of business systems.

对于数据敏感等级为第三等级的待处理数据，为商业秘密数据，中价值可间接利用，属较为重要数据，应仅限于企业内部相关人员使用，则被配置为仅对内部相关人员开放，加密存储于内部系统，且加密传输及限制输出。For the data to be processed whose data sensitivity level is the third level, it is commercial secret data, and its medium value can be used indirectly. It is relatively important data and should only be used by relevant personnel within the enterprise. It is configured to be open to relevant internal personnel only and encrypted. Stored in the internal system, encrypted transmission and limited output.

对于数据敏感等级为第四等级的待处理数据，为核心秘密数据，高价值可直接利用，属极为关键数据，应仅限于企业重要部门的特定人员使用，则被配置为仅对内部特定人员开放，加密存储于内部系统，且加密传输及限于特定业务场景下使用。For the data to be processed whose data sensitivity level is the fourth level, it is core secret data, high value can be used directly, it is extremely critical data, and should be limited to specific personnel in important departments of the enterprise, it is configured to be open to specific internal personnel only. , encrypted and stored in the internal system, and encrypted transmission and use only in specific business scenarios.

此外，本发明实施例提供了一种数据处理装置，如图4所示，所述装置包括：In addition, an embodiment of the present invention provides a data processing apparatus, as shown in FIG. 4 , the apparatus includes:

信息获取模块41，用于获取待处理数据的字段名对应的中文备注信息；The information acquisition module 41 is used for acquiring Chinese remark information corresponding to the field name of the data to be processed;

分词模块42，用于对所述中文备注信息进行分词处理和标准化处理，得到所述中文备注信息的特征向量；A word segmentation module 42, configured to perform word segmentation and standardization processing on the Chinese remark information, and obtain a feature vector of the Chinese remark information;

类别识别模块43，用于将所述特征向量输入数据类别识别模型，确定所述待处理数据的数据类别；其中，所述数据类别识别模型基于中文文本分类深度学习算法预先训练生成；所述数据类别为预先构建的至少两级类别目录下的最低层级子类别；The category identification module 43 is used for inputting the feature vector into a data category identification model to determine the data category of the data to be processed; wherein, the data category identification model is pre-trained and generated based on a Chinese text classification deep learning algorithm; the data A category is the lowest-level subcategory under a pre-built at least two-level category directory;

敏感等级确定模块44，用于根据所述数据类别，确定所述待处理数据的数据敏感等级。The sensitivity level determination module 44 is configured to determine the data sensitivity level of the data to be processed according to the data category.

在一些实施例中，所述数据类别识别模型通过以下步骤预先训练生成：In some embodiments, the data category recognition model is pre-trained and generated by the following steps:

在一些实施例中，所述数据类别识别模型为基于FastText算法预先训练生成的FastText模型。In some embodiments, the data category recognition model is a FastText model pre-trained and generated based on the FastText algorithm.

在一些实施例中，所述至少两级类别目录通过以下步骤预先构建：In some embodiments, the at least two-level category directory is pre-built by the following steps:

在一些实施例中，所述基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录，包括：In some embodiments, the k-means-based clustering algorithm performs cluster analysis processing on the data under the first-level category to generate at least two-level category catalogs, including:

在一些实施例中，所述数据处理装置还包括安全管控模块，所述安全管控模块用于：In some embodiments, the data processing apparatus further includes a security management and control module, and the security management and control module is used for:

在根据所述数据类别，确定所述待处理数据的数据敏感等级之后，根据所述数据敏感等级，对所述待处理数据进行与所述数据敏感等级匹配的安全管控处理。After the data sensitivity level of the data to be processed is determined according to the data category, security management and control processing that matches the data sensitivity level is performed on the data to be processed according to the data sensitivity level.

在一些实施例中，所述数据敏感等级包括敏感程度从低至高的第一等级、第二等级、第三等级和第四等级；所述安全管控模型具体用于：In some embodiments, the data sensitivity level includes a first level, a second level, a third level, and a fourth level of sensitivity from low to high; the security management model is specifically used for:

本发明方法实施例的内容均适用于本装置实施例，本装置实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法达到的有益效果也相同，具体请参见方法实施例中的说明，在此不再赘述。The contents of the method embodiments of the present invention are all applicable to the device embodiments. The specific functions implemented by the device embodiments are the same as the above-mentioned method embodiments, and the beneficial effects achieved are also the same as those achieved by the above-mentioned methods. For details, please refer to the method The descriptions in the embodiments are not repeated here.

此外，本发明实施例提供了一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现以上任一实施例所述的数据处理方法。其中，所述计算机可读存储介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、CD-ROM、和磁光盘)、ROM(Read-Only Memory，只读存储器)、RAM(Random AcceSS Memory，随即存储器)、EPROM(EraSable Programmable Read-Only Memory，可擦写可编程只读存储器)、EEPROM(Electrically EraSable Programmable Read-Only Memory，电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。也就是，存储设备包括由设备(例如，计算机、手机)以能够读的形式存储或传输信息的任何介质，可以是只读存储器，磁盘或光盘等。In addition, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the data processing method described in any of the foregoing embodiments is implemented. Wherein, the computer-readable storage medium includes but is not limited to any type of disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory, read-only memory), RAM (Random Access Memory, random access memory), EPROM (EraSable Programmable Read-Only Memory), EEPROM (Electrically EraSable Programmable Read-Only Memory), flash memory, magnetic card or light card. That is, a storage device includes any medium that stores or transmits information in a readable form by a device (eg, a computer, a mobile phone), and may be a read-only memory, a magnetic disk or an optical disk, and the like.

本发明方法实施例的内容均适用于本存储介质实施例，本存储介质实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法达到的有益效果也相同，具体请参见方法实施例中的说明，在此不再赘述。The contents of the method embodiments of the present invention are all applicable to the storage medium embodiments. The specific functions implemented by the storage medium embodiments are the same as the above-mentioned method embodiments, and the beneficial effects achieved are also the same as those achieved by the above-mentioned methods. Refer to the description in the method embodiment, which is not repeated here.

此外，本发明实施例还提供了一种计算机设备，本实施例所述的计算机设备可以是服务器、个人计算机以及网络设备等设备。所述计算机设备包括：一个或多个处理器，存储器，一个或多个计算机程序，其中所述一个或多个计算机程序被存储在存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个计算机程序配置用于执行以上任一实施例所述的数据处理方法。In addition, an embodiment of the present invention also provides a computer device, and the computer device described in this embodiment may be a server, a personal computer, a network device, or other devices. The computer device includes: one or more processors, a memory, one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors , the one or more computer programs are configured to execute the data processing method described in any of the above embodiments.

本发明方法实施例的内容均适用于本计算机设备实施例，本计算机设备实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法达到的有益效果也相同，具体请参见方法实施例中的说明，在此不再赘述。The contents of the method embodiments of the present invention are all applicable to the computer device embodiments. The specific functions implemented by the computer device embodiments are the same as the above-mentioned method embodiments, and the beneficial effects achieved are also the same as those achieved by the above-mentioned methods. Refer to the description in the method embodiment, which is not repeated here.

此外，在本发明各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.

以上所述仅是本发明的部分实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only some embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

Translated fromChinese

1.一种数据处理方法，其特征在于，包括如下步骤：1. a data processing method, is characterized in that, comprises the steps:

2.根据权利要求1所述的数据处理方法，其特征在于，所述数据类别识别模型通过以下步骤预先训练生成：2. The data processing method according to claim 1, wherein the data category recognition model is pre-trained and generated by the following steps:

3.根据权利要求1所述的数据处理方法，其特征在于，所述数据类别识别模型为基于FastText算法预先训练生成的FastText模型。3 . The data processing method according to claim 1 , wherein the data category identification model is a FastText model generated by pre-training based on the FastText algorithm. 4 .

4.根据权利要求1所述的数据处理方法，其特征在于，所述至少两级类别目录通过以下步骤预先构建：4. The data processing method according to claim 1, wherein the at least two-level category directory is pre-built by the following steps:

5.根据权利要求4所述的数据处理方法，其特征在于，所述基于k-means聚类算法对所述一级类别下的数据进行聚类分析处理，生成至少两级类别目录，包括：5. The data processing method according to claim 4, wherein the data under the first-level category is subjected to cluster analysis and processing based on a k-means clustering algorithm to generate at least two-level category catalogs, comprising:

6.根据权利要求1所述的数据处理方法，其特征在于，所述根据所述数据类别，确定所述待处理数据的数据敏感等级之后，还包括：6. The data processing method according to claim 1, wherein after determining the data sensitivity level of the data to be processed according to the data category, the method further comprises:

7.根据权利要求6所述的数据处理方法，其特征在于，所述数据敏感等级包括敏感程度从低至高的第一等级、第二等级、第三等级和第四等级；7. The data processing method according to claim 6, wherein the data sensitivity level comprises a first level, a second level, a third level and a fourth level of sensitivity from low to high;

8.一种数据处理装置，其特征在于，包括：8. A data processing device, comprising:

9.一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现权利要求1至7任一项所述的数据处理方法。9. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the data processing according to any one of claims 1 to 7 is realized method.

10.一种计算机设备，其特征在于，其包括：10. A computer equipment, characterized in that it comprises:

一个或多个处理器；one or more processors;

存储器；memory;

一个或多个计算机程序，其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个计算机程序配置用于：执行根据权利要求1至7任一项所述的数据处理方法。One or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to: The data processing method according to any one of claims 1 to 7 is carried out.