CN114416865A

Movatterモバイル変換

Info

Publication number: CN114416865A
Application number: CN202111494889.XA
Authority: CN
Inventors: 袁煜明; 宋子龙; 温志龙
Original assignee: Hainan Fire Chain Technology Co ltd
Current assignee: Hainan Fire Chain Technology Co ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-29

Abstract

The invention discloses a multi-source heterogeneous data preprocessing method, a device, computer equipment and a storage medium, wherein the method is used for preprocessing multi-source heterogeneous data based on a block chain and a word vector and comprises the following steps: grouping the multi-source heterogeneous data based on the data category; acquiring character information corresponding to the multi-source heterogeneous data; acquiring a word vector corresponding to the character information to form a word vector set; fusing the word vector set into a one-dimensional vector; and performing federated learning co-construction model according to the one-dimensional vector. The scheme can solve the problem that various types of data including voice data, image data, video data and the like are difficult to fuse and analyze, and further modeling training can be performed according to the data after fusion and analysis. And the scores are obtained by scoring the data according to the importance, and the scores and the data storage positions corresponding to the scores are uploaded to the block chain, so that the credible traceability of the important data can be guaranteed, and the safety and the non-falsification of the government affair data can be guaranteed.

Description

Translated fromChinese

多源异构数据预处理方法、装置、计算机设备及存储介质Multi-source heterogeneous data preprocessing method, device, computer equipment and storage medium

技术领域technical field

本发明属于计算机技术领域，具体涉及一种多源异构数据预处理方法、装置、计算机设备及存储介质。The invention belongs to the technical field of computers, and in particular relates to a multi-source heterogeneous data preprocessing method, device, computer equipment and storage medium.

背景技术Background technique

在政务大数据，即各个政府机构内部收集的会议音频视频数据、机构人员内部个人信息、政府机构采购预算账单等政务机关内部关键隐私信息，具有数据来源广泛、数据结构复杂、数据类型多样、内容含义不易理解、部分数据可信度低的特性。现有技术中，无法针对包含语音数据、图像数据、视频数据等多种类型的数据进行融合分析，进而导致难以进行建模训练。Government big data, that is, the audio and video data of meetings collected within various government agencies, the personal information of agency personnel, government agency procurement budget bills and other key private information within government agencies, has a wide range of data sources, complex data structures, diverse data types, and content. The meaning is not easy to understand, and some data have low credibility. In the prior art, it is impossible to perform fusion analysis on various types of data including speech data, image data, and video data, which makes it difficult to perform modeling training.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist the understanding of the technical solutions of the present invention, and does not mean that the above content is the prior art.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于，提供一种多源异构数据预处理方法、装置、计算机设备及存储介质，以解决包含语音数据、图像数据、视频数据等多种类型的数据难以融合分析，导致无法进行建模训练的问题。The purpose of the present invention is to provide a multi-source heterogeneous data preprocessing method, device, computer equipment and storage medium, so as to solve the problem that it is difficult to integrate and analyze various types of data including voice data, image data, video data, etc. Modeling training problems.

本发明提供一种多源异构数据预处理方法，所述方法基于区块链和词向量对多源异构数据进行预处理，包括：基于数据类别对多源异构数据进行分组；获取所述多源异构数据对应的文字信息；获取与所述文字信息对应的字向量，形成字向量集合；融合所述字向量集合为一维向量；根据所述一维向量进行联邦学习共建模型。The present invention provides a multi-source heterogeneous data preprocessing method. The method preprocesses multi-source heterogeneous data based on blockchain and word vectors, including: grouping multi-source heterogeneous data based on data categories; generating text information corresponding to the multi-source heterogeneous data; obtaining word vectors corresponding to the text information to form a word vector set; fusing the word vector set into a one-dimensional vector; performing federated learning to build a model according to the one-dimensional vector .

在一些实施方式中，所述基于数据类别对多源异构数据进行分组，包括：通过分类器将同一数据类别的所述多源异构数据分为一组；所述数据类别包括语音类数据、图像类数据和视频类数据。In some embodiments, the grouping the multi-source heterogeneous data based on the data category includes: grouping the multi-source heterogeneous data of the same data category into a group by a classifier; the data category includes voice data , image data and video data.

在一些实施方式中，所述获取所述目标数据对应的文字信息，包括：将所述多源异构数据送入对应的模型中获取对应的所述文字信息。In some embodiments, the acquiring the text information corresponding to the target data includes: sending the multi-source heterogeneous data into a corresponding model to acquire the corresponding text information.

在一些实施方式中，所述获取与所述文字信息对应的字向量，形成字向量集合，包括：获取所述文字信息中每个文字对应的所述字向量，每段文字对应的所述字向量形成所述字向量集合。In some implementation manners, the acquiring a word vector corresponding to the text information to form a word vector set includes: acquiring the word vector corresponding to each text in the text information, the word vector corresponding to each paragraph of text The vectors form the set of word vectors.

在一些实施方式中，所述基于数据类别对多源异构数据进行分组之后，还包括：每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余；所述多源异构数据的属性包括数据名称、数据大小和数据关键词；若每组所述多源异构数据中，多条数据的名称、大小和内容相同则保留一条；若每组所述多源异构数据中，多条数据的关键词相同，则随机保留5％～15％的数据；若每组所述多源异构数据中，多条数据的名称和大小相同，且内容不同，则修改数据的名称。In some embodiments, after the multi-source heterogeneous data is grouped based on the data category, the method further includes: in each group of the multi-source heterogeneous data, removing redundancy according to attributes of the multi-source heterogeneous data; The attributes of the multi-source heterogeneous data include data name, data size and data keywords; if each group of multi-source heterogeneous data has the same name, size and content of multiple pieces of data, keep one; In multi-source heterogeneous data, if the keywords of multiple pieces of data are the same, 5% to 15% of the data will be randomly reserved; if in each group of multi-source heterogeneous data, the names and sizes of multiple pieces of data are the same, and the content is different , modify the name of the data.

在一些实施方式中，所述每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余之后，还包括：对去除冗余后的每组所述多源异构数据根据重要度进行打分得到分数，并将分数和与分数对应的数据存储位置上传至区块链；In some embodiments, in each group of the multi-source heterogeneous data, after removing the redundancy according to the attributes of the multi-source heterogeneous data, the method further includes: performing the redundancy removal on each group of the multi-source heterogeneous data. The structural data is scored according to the importance to get the score, and the score and the data storage location corresponding to the score are uploaded to the blockchain;

所述多源异构数据每出现一次或所述多源异构数据的关键字每出现一次，则加第一分数；所述多源异构数据被收藏，则加第二分数。Each time the multi-source heterogeneous data appears once or the keyword of the multi-source heterogeneous data appears once, a first score is added; the multi-source heterogeneous data is collected, a second score is added.

与上述方法相匹配，本发明另一方面提供一种多源异构数据预处理装置，所述装置基于区块链和词向量对多源异构数据进行预处理，包括：分组单元，被配置为基于数据类别对多源异构数据进行分组；获取单元，被配置为获取所述多源异构数据对应的文字信息，并获取与所述文字信息对应的字向量，形成字向量集合；融合单元，被配置为融合所述字向量集合为一维向量；建模单元，被配置为根据所述一维向量进行联邦学习共建模型。Matching the above method, another aspect of the present invention provides a multi-source heterogeneous data preprocessing device, the device pre-processing multi-source heterogeneous data based on blockchain and word vectors, comprising: a grouping unit configured to In order to group the multi-source heterogeneous data based on the data category; the obtaining unit is configured to obtain the text information corresponding to the multi-source heterogeneous data, and obtain the word vector corresponding to the text information to form a set of word vectors; fusion; The unit is configured to fuse the set of word vectors into a one-dimensional vector; the modeling unit is configured to perform federated learning and jointly build a model according to the one-dimensional vector.

在一些实施方式中，所述基于数据类别对多源异构数据进行分组，包括：所述分组单元包括分类器，通过所述分类器将同一数据类别的所述多源异构数据分为一组；所述数据类别包括语音类数据、图像类数据和视频类数据。In some embodiments, the grouping of the multi-source heterogeneous data based on the data category includes: the grouping unit includes a classifier, and the multi-source heterogeneous data of the same data category is divided into one by the classifier. group; the data categories include voice data, image data and video data.

在一些实施方式中，所述获取所述多源异构数据对应的文字信息，并获取与所述文字信息对应的字向量，形成字向量集合，包括：将所述多源异构数据送入对应的模型中获取对应的所述文字信息，获取所述文字信息中每个文字对应的所述字向量，每段文字对应的所述字向量形成所述字向量集合。In some embodiments, the acquiring text information corresponding to the multi-source heterogeneous data, and acquiring the word vector corresponding to the text information, to form a word vector set, includes: sending the multi-source heterogeneous data into The corresponding text information is acquired in the corresponding model, the word vector corresponding to each text in the text information is acquired, and the word vector corresponding to each paragraph of text forms the word vector set.

在一些实施方式中，多源异构数据预处理装置还包括：去冗单元，被配置为在每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余；所述多源异构数据的属性包括数据名称、数据大小和数据关键词；若每组所述多源异构数据中，多条数据的名称、大小和内容相同则保留一条；若每组所述多源异构数据中，多条数据的关键词相同，则随机保留5％～15％的数据；若每组所述多源异构数据中，多条数据的名称和大小相同，且内容不同，则修改数据的名称。In some embodiments, the multi-source heterogeneous data preprocessing apparatus further includes: a redundancy removal unit configured to remove redundancy according to attributes of the multi-source heterogeneous data in each group of the multi-source heterogeneous data; The attributes of the multi-source heterogeneous data include data name, data size and data keywords; if each group of the multi-source heterogeneous data has the same name, size and content of multiple pieces of data, one is reserved; In the multi-source heterogeneous data, if the keywords of multiple pieces of data are the same, 5% to 15% of the data are randomly reserved; if in each group of the multi-source heterogeneous data, the names and sizes of the pieces of data are the same, and the content If not, modify the name of the data.

在一些实施方式中，多源异构数据预处理装置还包括：打分单元，被配置为对去除冗余后的每组所述多源异构数据根据重要度进行打分得到分数，将分数和与分数对应的数据存储位置上传至区块链；所述多源异构数据每出现一次或所述多源异构数据的关键字每出现一次，则加第一分数；所述多源异构数据被收藏，则加第二分数。In some embodiments, the multi-source heterogeneous data preprocessing apparatus further includes: a scoring unit, configured to score each group of the multi-source heterogeneous data after redundancy removal according to the importance degree to obtain a score, and compare the score and the The data storage location corresponding to the score is uploaded to the blockchain; each time the multi-source heterogeneous data appears once or the keyword of the multi-source heterogeneous data appears once, the first score is added; the multi-source heterogeneous data Favorites will add a second score.

与上述装置相匹配，本发明再一方面提供一种计算机设备，包括：如上述的多源异构数据预处理装置。Matching with the above device, another aspect of the present invention provides a computer device, comprising: the above multi-source heterogeneous data preprocessing device.

与上述方法相匹配，本发明再一方面提供一种存储介质，所述存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行上述的多源异构数据预处理方法。Matching with the above method, another aspect of the present invention provides a storage medium, the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to perform the above-mentioned multi-source heterogeneous data pre-processing. Approach.

由此，本发明的方案，通过基于数据类别对多源异构数据进行分组，并获取多源异构数据对应的文字信息，以及获取与文字信息对应的字向量，形成字向量集合，解决了多种类型的数据难以融合分析的问题，通过融合字向量集合为一维向量，并根据一维向量进行联邦学习共建模型，能够简化政府机关人员工作流程，提高工作质量和效率。Therefore, in the solution of the present invention, the multi-source heterogeneous data is grouped based on the data category, the text information corresponding to the multi-source heterogeneous data is obtained, and the word vector corresponding to the text information is obtained to form a word vector set, which solves the problem of solving the problem. For the problem that various types of data are difficult to integrate and analyze, by merging word vectors into one-dimensional vectors, and building a model based on federated learning based on the one-dimensional vectors, it can simplify the workflow of government personnel and improve work quality and efficiency.

通过对去除冗余后的每组所述多源异构数据根据重要度进行打分得到分数，并将分数和与分数对应的数据存储位置上传至区块链，能够保障重要数据的可信溯源，保障政务数据的安全不可篡改。By scoring each group of the multi-source heterogeneous data after removing redundancy according to the importance, and uploading the score and the data storage location corresponding to the score to the blockchain, the trusted traceability of important data can be guaranteed. Ensure the security of government data and cannot be tampered with.

本发明的其它特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。Other features and advantages of the present invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention.

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.

附图说明Description of drawings

图1为本发明的多源异构数据预处理方法的一实施例的流程示意图；1 is a schematic flowchart of an embodiment of a multi-source heterogeneous data preprocessing method of the present invention;

图2为本发明的多源异构数据预处理装置的一实施例的结构示意图。FIG. 2 is a schematic structural diagram of an embodiment of a multi-source heterogeneous data preprocessing apparatus of the present invention.

结合附图，本发明实施例中附图标记如下：With reference to the accompanying drawings, the reference numerals in the embodiments of the present invention are as follows:

101、分组单元；102、去冗单元；103、打分单元；104、获取单元；105、融合单元；106、建模单元。101, a grouping unit; 102, a redundancy removal unit; 103, a scoring unit; 104, an acquisition unit; 105, a fusion unit; 106, a modeling unit.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明具体实施例及相应的附图对本发明技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the specific embodiments of the present invention and the corresponding drawings. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

根据本发明的实施例，提供了一种多源异构数据预处理方法，如图1所示，本发明的方法的一实施例的流程示意图。该多源异构数据预处理方法基于区块链和词向量对多源异构数据进行预处理，可以包括：步骤S101、步骤S201、步骤S301、步骤S401和步骤S501。According to an embodiment of the present invention, a multi-source heterogeneous data preprocessing method is provided, as shown in FIG. 1 , a schematic flowchart of an embodiment of the method of the present invention. The multi-source heterogeneous data preprocessing method preprocesses the multi-source heterogeneous data based on the blockchain and the word vector, and may include: step S101 , step S201 , step S301 , step S401 and step S501 .

在步骤S101中，基于数据类别对多源异构数据进行分组。In step S101, the multi-source heterogeneous data is grouped based on the data category.

在一些实施方式中，步骤S101中通过分类器将同一数据类别的所述多源异构数据分为一组，所述数据类别包括语音类数据、图像类数据和视频类数据。In some embodiments, in step S101, the multi-source heterogeneous data of the same data category is grouped into a group by a classifier, and the data category includes voice data, image data and video data.

作为一种实施方式，将语音类数据分为一组，将图像类数据分为一组，将视频类数据分为一组。As an implementation manner, voice data is divided into one group, image data is divided into one group, and video data is divided into one group.

步骤S201，获取所述多源异构数据对应的文字信息。Step S201, acquiring text information corresponding to the multi-source heterogeneous data.

在一些实施方式中，步骤S201中，将所述多源异构数据送入对应的模型中获取对应的所述文字信息。In some embodiments, in step S201, the multi-source heterogeneous data is sent into a corresponding model to obtain the corresponding text information.

作为一种实施方式，将语音类数据送入语音识别模型转化为对应的文字信息，将图像类数据送入图像识别模型转化为对应的文字信息，将视频类数据送入视频识别模型转化为对应的文字信息。因文字信息容易处理，因此所有形式的数据统一转化为文字信息的形式。As an embodiment, the speech data is sent into the speech recognition model to be converted into corresponding text information, the image data is sent into the image recognition model to be converted into corresponding text information, and the video data is sent into the video recognition model to be converted into corresponding text information. text information. Because text information is easy to handle, all forms of data are uniformly converted into text information.

其中，将图像类数据和视频类数据逐帧送入对应的模型中获取对应的所述文字信息。The image data and video data are sent into the corresponding model frame by frame to obtain the corresponding text information.

步骤S301，获取与所述文字信息对应的字向量，形成字向量集合。Step S301: Obtain word vectors corresponding to the text information to form a word vector set.

在一些实施方式中，步骤S301中，获取所述文字信息中每个文字对应的所述字向量，每段文字对应的所述字向量形成所述字向量集合。In some embodiments, in step S301, the word vector corresponding to each text in the text information is obtained, and the word vector corresponding to each piece of text forms the word vector set.

作为一种实施方式，虽然已将多源异构数据转化为文字信息，但是文字信息不方便送入训练网络，需要将文字信息预处理编程数字向量后，才能送入网络。将多源异构数据转化为文字信息后，文字信息中每句文字里面的每个字对应于已有的汉字词向量表来获取相应字向量，因此每句文字都对应一组字向量集合。通过使用embeddding方法将汉字对应于一个字向量。可以用已有的汉字词向量表做对照，将文字信息中的每个字都转换成对应的字向量。因此，每句话都会得到一组对应的字向量集合，字向量集合里面的每个字对应于一个字向量，该字向量和字的关系可以从汉字词向量表中查阅。As an embodiment, although the multi-source heterogeneous data has been converted into text information, the text information is inconvenient to be sent to the training network, and the text information needs to be preprocessed to program a digital vector before it can be sent to the network. After the multi-source heterogeneous data is converted into text information, each word in each sentence in the text information corresponds to the existing Chinese word vector table to obtain the corresponding word vector, so each sentence corresponds to a set of word vectors. . Corresponding Chinese characters to a word vector by using the embedding method. The existing Chinese word vector table can be used for comparison, and each word in the text information can be converted into a corresponding word vector. Therefore, each sentence will get a set of corresponding word vector sets, each word in the word vector set corresponds to a word vector, and the relationship between the word vector and the word can be viewed from the Chinese word vector table.

步骤S401，融合所述字向量集合为一维向量。Step S401, fuse the set of word vectors into a one-dimensional vector.

在一些实施方式中，步骤S401中，将每组字向量集合融合为一维向量。由于每句话的字数不一，在步骤S301中获取的每句话对应向量组的维度不同，不同维度的数据不便送入模型。需要将字向量集合降维为一维向量进行处理。由于每个字向量集合的行向量维度相同，列向量维度不同，将每行对应的所有列相加，进而获得维度为行向量的一维向量，所有的向量维度统一。In some embodiments, in step S401, each group of word vector sets is fused into a one-dimensional vector. Since the number of words in each sentence is different, the dimension of the corresponding vector group for each sentence obtained in step S301 is different, and it is inconvenient to send data of different dimensions into the model. It is necessary to reduce the dimension of the word vector set to a one-dimensional vector for processing. Since the row vector dimensions of each word vector set are the same and the column vector dimensions are different, all columns corresponding to each row are added to obtain a one-dimensional vector whose dimension is a row vector, and all the vector dimensions are unified.

步骤S501，根据所述一维向量进行联邦学习共建模型。Step S501, performing federated learning to jointly build a model according to the one-dimensional vector.

在一些实施方式中，步骤S501中，各个部门可以进行联邦学习共建模型。每个部门将部门内部的数据统一后便即可使用联邦学习的方式共建模型。首先由关键政务部门，即主管部门商定模型结构、参与的子部门、聚合部门等。对于每一轮联邦训练，参与部门下载最新全局模型基于融合后的一维向量进行本地训练。参与部门完成本地训练之后将模型参数数据全部上传给聚合部门，聚合部门获取到所有参与部门上传的模型参数之后开始进行模型聚合，即使用计算均值的方式聚合模型参数。参数聚合完成后将参数保存为全局模型。下次训练开始时，参与训练的部门再次下载全局模型进行本地训练。循环训练，直到全局模型的精度满足关键政务部门的要求后停止训练。并将全局模型部署到需要使用该模型的设备上，完成联邦学习共建模型。In some embodiments, in step S501, each department may perform federated learning to jointly build a model. After each department unifies the data within the department, the federated learning method can be used to jointly build a model. The model structure, participating sub-sectors, aggregating sectors, etc. are first agreed upon by key government departments, i.e. the competent authorities. For each round of federated training, participating departments download the latest global model for local training based on the fused one-dimensional vector. After the participating departments complete the local training, they upload all the model parameter data to the aggregation department. After the aggregation department obtains the model parameters uploaded by all participating departments, it starts model aggregation, that is, the model parameters are aggregated by calculating the mean value. Save the parameters as a global model after parameter aggregation is complete. When the next training starts, the departments participating in the training download the global model again for local training. Loop training until the accuracy of the global model meets the requirements of key government departments and stop training. And deploy the global model to the device that needs to use the model to complete the federated learning co-build model.

在所述基于数据类别对多源异构数据进行分组之后，且在获取所述多源异构数据对应的文字信息之前，所述多源异构数据预处理方法还包括，步骤S102：每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余；即，根据数据的属性去除冗余。所述多源异构数据的属性包括数据名称、数据大小和数据关键词；若每组所述多源异构数据中，多条数据的名称、大小和内容相同则保留一条；若每组所述多源异构数据中，多条数据的关键词相同，则随机保留5％～15％的数据；若每组所述多源异构数据中，多条数据的名称和大小相同，且内容不同，则修改数据的名称。After the multi-source heterogeneous data is grouped based on the data category and before the text information corresponding to the multi-source heterogeneous data is acquired, the multi-source heterogeneous data preprocessing method further includes, step S102: each group In the multi-source heterogeneous data, the redundancy is removed according to the attributes of the multi-source heterogeneous data; that is, the redundancy is removed according to the attributes of the data. The attributes of the multi-source heterogeneous data include data name, data size and data keywords; if each group of the multi-source heterogeneous data has the same name, size and content of multiple pieces of data, one is reserved; In the multi-source heterogeneous data, if the keywords of multiple pieces of data are the same, 5% to 15% of the data are randomly reserved; if in each group of the multi-source heterogeneous data, the names and sizes of the pieces of data are the same, and the content If not, modify the name of the data.

在该步骤中，各部门内部对同组所述多源异构数据中的数据根据数据名称、数据大小、数据关键词等属性去冗余。例如，对语音类数据所在组内的语音数据根据语音数据名称、语音数据大小、语音数据关键词等属性去冗余。对图像类数据所在组内的图像数据根据图像数据名称、图像数据大小、图像数据关键词等属性去冗余。对视频类数据所在组内的视频数据根据视频数据名称、视频数据大小、视频数据关键词等属性去冗余。若多条数据名称，大小，内容相同则保留一条；若多条数据关键词相同，则随机保留10％的数据；若多条数据名称，大小相同但内容不同，则并不去冗余，修改名称即可。In this step, each department internally removes redundancy for the data in the multi-source heterogeneous data in the same group according to attributes such as data name, data size, and data keywords. For example, the voice data in the group where the voice data is located are de-redundant according to attributes such as voice data name, voice data size, and voice data keywords. The image data in the group where the image data is located is de-redundant according to attributes such as image data name, image data size, and image data keywords. The video data in the group where the video data is located are de-redundant according to attributes such as video data name, video data size, and video data keywords. If multiple pieces of data have the same name, size, and content, keep one; if multiple pieces of data have the same keyword, 10% of the data will be randomly reserved; if multiple pieces of data have the same name, size, and content, they will not be redundant, and will be modified. name will do.

在所述每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余之后，且在获取所述多源异构数据对应的文字信息之前，所述多源异构数据预处理方法还包括，步骤S103：对去除冗余后的每组所述多源异构数据根据重要度进行打分得到分数，并将分数和与分数对应的数据存储位置上传至区块链；即，根据重要度对数据打分，将分数和对应的数据存储位置上传区块链。所述多源异构数据每出现一次或所述多源异构数据的关键字每出现一次，则加第一分数；所述多源异构数据被收藏，则加第二分数。In each set of the multi-source heterogeneous data, after removing redundancy according to the attributes of the multi-source heterogeneous data, and before acquiring the text information corresponding to the multi-source heterogeneous data, the multi-source heterogeneous data The structure data preprocessing method further includes, step S103: scoring each group of the multi-source heterogeneous data after redundancy removal to obtain a score according to the importance, and uploading the score and the data storage location corresponding to the score to the blockchain ; That is, the data is scored according to the importance, and the score and the corresponding data storage location are uploaded to the blockchain. Each time the multi-source heterogeneous data appears once or the keyword of the multi-source heterogeneous data appears once, a first score is added; the multi-source heterogeneous data is collected, a second score is added.

在该步骤中，对去除冗余后的数据依据重要程度进行打分，并将分数与数据存储位置上传至区块链。数据去冗余过程中能够观察到数据的重要程度，若数据重复较多则数据重要度高。若某些关键字词频繁出现，则涉及这些关键词的数据重要度高。若数据被收藏，则数据重要度高。数据重要度为整数数值，每条数据统计该数据的出现次数，关键字出现次数，是否收藏等信息。该数据出现一次重要度打分度加第一分数，该数据关键字出现一次重要度打分加第一分数，该数据被收藏重要度打分加第二分数。所有重要度打分之和为该条数据的最终分数。分数与数据存储位置上传区块链保存便于溯源。In this step, the redundant data is scored according to the degree of importance, and the score and data storage location are uploaded to the blockchain. In the process of data de-redundancy, the importance of data can be observed. If the data is repeated more, the importance of the data is high. If certain key words appear frequently, the importance of data involving these keywords is high. If the data is bookmarked, the importance of the data is high. The importance of the data is an integer value, and each piece of data counts the number of occurrences of the data, the number of keyword occurrences, and whether it is a favorite. The data appears once the importance score plus the first score, the data keyword occurs once the importance score plus the first score, the data is collected by the importance score plus the second score. The sum of all importance scores is the final score of this piece of data. Scores and data storage locations are uploaded to the blockchain for easy traceability.

在该步骤中，第一分数为一分，第二分数为五分。In this step, the first point is one point and the second point is five points.

在该步骤后，将去除冗余后的多源异构数据送入对应的模型中获取对应的所述文字信息。After this step, the redundant multi-source heterogeneous data is sent into the corresponding model to obtain the corresponding text information.

在步骤S501中，由于每条数据有不同的重要度打分，重要度打分为N，则该条数据进行本地训练时的训练轮数为N次。In step S501, since each piece of data has a different importance score, and the importance score is N, the number of training rounds when the piece of data is locally trained is N times.

根据本发明的实施例，还提供了对应于多源异构数据预处理方法的一种多源异构数据预处理装置，如图2所示，本发明的多源异构数据预处理装置的一实施例的结构示意图。该多源异构数据预处理装置基于区块链和词向量对多源异构数据进行预处理，包括：分组单元101、获取单元104、融合单元105和建模单元106。According to an embodiment of the present invention, a multi-source heterogeneous data preprocessing device corresponding to the multi-source heterogeneous data preprocessing method is also provided. As shown in FIG. 2 , the multi-source heterogeneous data preprocessing device of the present invention has a A schematic structural diagram of an embodiment. The multi-source heterogeneous data preprocessing device preprocesses multi-source heterogeneous data based on the blockchain and word vectors, and includes: a grouping unit 101 , an obtaining unit 104 , a fusion unit 105 and a modeling unit 106 .

其中，分组单元101，被配置为基于数据类别对多源异构数据进行分组；Wherein, the grouping unit 101 is configured to group the multi-source heterogeneous data based on the data category;

在一些实施方式中，所述分组单元101包括分类器，通过所述分类器将同一数据类别的所述多源异构数据分为一组；所述数据类别包括语音类数据、图像类数据和视频类数据。In some embodiments, the grouping unit 101 includes a classifier, through which the multi-source heterogeneous data of the same data category is grouped into a group; the data category includes voice data, image data and video data.

作为一种实施方式，通过分类器将语音类数据分为一组，将图像类数据分为一组，将视频类数据分为一组。As an implementation manner, the speech data is divided into one group, the image data is divided into one group, and the video data is divided into one group by the classifier.

获取单元104被配置为获取所述多源异构数据对应的文字信息，并获取与所述文字信息对应的字向量，形成字向量集合。The acquiring unit 104 is configured to acquire text information corresponding to the multi-source heterogeneous data, and acquire word vectors corresponding to the text information to form a set of word vectors.

在一些实施方式中，将所述多源异构数据送入对应的模型中获取对应的所述文字信息，获取所述文字信息中每个文字对应的所述字向量，每段文字对应的所述字向量形成所述字向量集合。In some embodiments, the multi-source heterogeneous data is sent into a corresponding model to obtain the corresponding text information, the word vector corresponding to each text in the text information is obtained, and the corresponding text of each paragraph of text is obtained. The word vectors form the set of word vectors.

作为一种实施方式，将语音类数据送入语音识别模型转化为对应的文字信息，将图像类数据送入图像识别模型转化为对应的文字信息，将视频类数据送入视频识别模型转化为对应的文字信息。因文字信息容易处理，因此所有形式的数据统一转化为文字信息的形式。其中，将图像类数据和视频类数据逐帧送入对应的模型中获取对应的所述文字信息。虽然已将多源异构数据转化为文字信息，但是文字信息不方便送入训练网络，需要将文字信息预处理编程数字向量后，才能送入网络。将多源异构数据转化为文字信息后，文字信息中每句文字里面的每个字对应于已有的汉字词向量表来获取相应字向量，因此每句文字都对应一组字向量集合。通过使用embeddding方法将汉字对应于一个字向量。可以用已有的汉字词向量表做对照，将文字信息中的每个字都转换成对应的字向量。因此，每句话都会得到一组对应的字向量集合，字向量集合里面的每个字对应于一个字向量，该字向量和字的关系可以从汉字词向量表中查阅。As an embodiment, the speech data is sent into the speech recognition model to be converted into corresponding text information, the image data is sent into the image recognition model to be converted into corresponding text information, and the video data is sent into the video recognition model to be converted into corresponding text information. text information. Because text information is easy to handle, all forms of data are uniformly converted into text information. The image data and video data are sent into the corresponding model frame by frame to obtain the corresponding text information. Although multi-source heterogeneous data has been converted into text information, it is inconvenient to send text information into the training network. It is necessary to preprocess the text information to program a digital vector before sending it into the network. After the multi-source heterogeneous data is converted into text information, each word in each sentence in the text information corresponds to the existing Chinese word vector table to obtain the corresponding word vector, so each sentence corresponds to a set of word vectors. . Corresponding Chinese characters to a word vector by using the embedding method. The existing Chinese word vector table can be used for comparison, and each word in the text information can be converted into a corresponding word vector. Therefore, each sentence will get a set of corresponding word vector sets, each word in the word vector set corresponds to a word vector, and the relationship between the word vector and the word can be viewed from the Chinese word vector table.

融合单元105，被配置为融合所述字向量集合为一维向量。The fusion unit 105 is configured to fuse the set of word vectors into a one-dimensional vector.

在一些实施方式中，通过融合单元105将每组字向量集合融合为一维向量。由于每句话的字数不一，获取单元104获取到的每句话对应向量组的维度不同，不同维度的数据不便送入模型。需要将字向量集合降维为一维向量进行处理。由于每个字向量集合的行向量维度相同，列向量维度不同，将每行对应的所有列相加，进而获得维度为行向量的一维向量，所有的向量维度统一。In some embodiments, each set of word vector sets is fused into a one-dimensional vector by the fusion unit 105 . Since the number of words in each sentence is different, the dimensions of the corresponding vector groups for each sentence obtained by the obtaining unit 104 are different, and it is inconvenient to send data of different dimensions into the model. It is necessary to reduce the dimension of the word vector set to a one-dimensional vector for processing. Since the row vector dimensions of each word vector set are the same and the column vector dimensions are different, all columns corresponding to each row are added to obtain a one-dimensional vector whose dimension is a row vector, and all the vector dimensions are unified.

建模单元106，被配置为根据所述一维向量进行联邦学习共建模型。The modeling unit 106 is configured to perform federated learning and jointly build a model according to the one-dimensional vector.

在一些实施方式中，通过建模单元106可使各个部门可以进行联邦学习共建模型。每个部门将部门内部的数据统一后便即可使用联邦学习的方式共建模型。首先由关键政务部门，即主管部门商定模型结构、参与的子部门、聚合部门等。对于每一轮联邦训练，参与部门下载最新全局模型基于融合后的一维向量进行本地训练。参与部门完成本地训练之后将模型参数数据全部上传给聚合部门，聚合部门获取到所有参与部门上传的模型参数之后开始进行模型聚合，即使用计算均值的方式聚合模型参数。参数聚合完成后将参数保存为全局模型。下次训练开始时，参与训练的部门再次下载全局模型进行本地训练。循环训练，直到全局模型的精度满足关键政务部门的要求后停止训练。并将全局模型部署到需要使用该模型的设备上，完成联邦学习共建模型。In some embodiments, the modeling unit 106 enables various departments to perform federated learning to jointly build models. After each department unifies the data within the department, the federated learning method can be used to jointly build a model. The model structure, participating sub-sectors, aggregating sectors, etc. are first agreed upon by key government departments, i.e. the competent authorities. For each round of federated training, participating departments download the latest global model for local training based on the fused one-dimensional vector. After the participating departments complete the local training, they upload all the model parameter data to the aggregation department. After the aggregation department obtains the model parameters uploaded by all participating departments, it starts model aggregation, that is, the model parameters are aggregated by calculating the mean value. Save the parameters as a global model after parameter aggregation is complete. When the next training starts, the departments participating in the training download the global model again for local training. Loop training until the accuracy of the global model meets the requirements of key government departments and stop training. And deploy the global model to the device that needs to use the model to complete the federated learning co-build model.

多源异构数据预处理装置还包括：去冗单元102。去冗单元102位于分组单元101之后，且位于获取单元104之前。去冗单元102被配置为在每组所述多源异构数据中，根据所述多源异构数据的属性去除冗余。所述多源异构数据的属性包括数据名称、数据大小和数据关键词；若每组所述多源异构数据中，多条数据的名称、大小和内容相同则保留一条；若每组所述多源异构数据中，多条数据的关键词相同，则随机保留5％～15％的数据；若每组所述多源异构数据中，多条数据的名称和大小相同，且内容不同，则修改数据的名称。The multi-source heterogeneous data preprocessing apparatus further includes: a de-redundancy unit 102 . The de-redundancy unit 102 is located after the grouping unit 101 and before the acquisition unit 104 . The de-redundancy unit 102 is configured to remove redundancy in each group of the multi-source heterogeneous data according to the attributes of the multi-source heterogeneous data. The attributes of the multi-source heterogeneous data include data name, data size and data keywords; if each group of the multi-source heterogeneous data has the same name, size and content of multiple pieces of data, one is reserved; In the multi-source heterogeneous data, if the keywords of multiple pieces of data are the same, 5% to 15% of the data are randomly reserved; if in each group of the multi-source heterogeneous data, the names and sizes of the pieces of data are the same, and the content If not, modify the name of the data.

通过去冗单元102，各部门内部对同组所述多源异构数据中的数据根据数据名称、数据大小、数据关键词等属性去冗余。例如，对语音类数据所在组内的语音数据根据语音数据名称、语音数据大小、语音数据关键词等属性去冗余。对图像类数据所在组内的图像数据根据图像数据名称、图像数据大小、图像数据关键词等属性去冗余。对视频类数据所在组内的视频数据根据视频数据名称、视频数据大小、视频数据关键词等属性去冗余。若多条数据名称，大小，内容相同则保留一条；若多条数据关键词相同，则随机保留10％的数据；若多条数据名称，大小相同但内容不同，则并不去冗余，修改名称即可。Through the redundancy removal unit 102, each department internally removes redundancy for the data in the multi-source heterogeneous data in the same group according to attributes such as data name, data size, and data keyword. For example, the voice data in the group where the voice data is located are de-redundant according to attributes such as voice data name, voice data size, and voice data keywords. The image data in the group where the image data is located is de-redundant according to attributes such as image data name, image data size, and image data keywords. The video data in the group where the video data is located are de-redundant according to attributes such as video data name, video data size, and video data keywords. If multiple pieces of data have the same name, size, and content, keep one; if multiple pieces of data have the same keyword, 10% of the data will be randomly reserved; if multiple pieces of data have the same name, size, and content, they will not be redundant, and will be modified. name will do.

多源异构数据预处理装置还包括：打分单元103。打分单元103位于去冗单元102之后，且位于获取单元104之前。打分单元103被配置为对去除冗余后的每组所述多源异构数据根据重要度进行打分得到分数，将分数和与分数对应的数据存储位置上传至区块链；所述多源异构数据每出现一次或所述多源异构数据的关键字每出现一次，则加第一分数；所述多源异构数据被收藏，则加第二分数。The multi-source heterogeneous data preprocessing apparatus further includes: a scoring unit 103 . The scoring unit 103 is located after the de-redundancy unit 102 and before the acquiring unit 104 . The scoring unit 103 is configured to score each group of the multi-source heterogeneous data after redundancy removal to obtain a score according to the importance, and upload the score and the data storage location corresponding to the score to the blockchain; The first score is added every time the structured data appears or the keyword of the multi-source heterogeneous data appears once; the second score is added when the multi-source heterogeneous data is collected.

通过打分单元103，可对去除冗余后的数据依据重要程度进行打分，并将分数与数据存储位置上传至区块链。数据去冗余过程中能够观察到数据的重要程度，若数据重复较多则数据重要度高。若某些关键字词频繁出现，则涉及这些关键词的数据重要度高。若数据被收藏，则数据重要度高。数据重要度为整数数值，每条数据统计该数据的出现次数，关键字出现次数，是否收藏等信息。该数据出现一次重要度打分度加第一分数，该数据关键字出现一次重要度打分加第一分数，该数据被收藏重要度打分加第二分数。所有重要度打分之和为该条数据的最终分数。分数与数据存储位置上传区块链保存便于溯源。Through the scoring unit 103, the redundant data can be scored according to the degree of importance, and the score and the data storage location can be uploaded to the blockchain. In the process of data de-redundancy, the importance of data can be observed. If the data is repeated more, the importance of the data is high. If certain key words appear frequently, the importance of data involving these keywords is high. If the data is bookmarked, the importance of the data is high. The importance of the data is an integer value, and each piece of data counts the number of occurrences of the data, the number of keyword occurrences, and whether it is a favorite. The data appears once the importance score plus the first score, the data keyword occurs once the importance score plus the first score, the data is collected by the importance score plus the second score. The sum of all importance scores is the final score of this piece of data. Scores and data storage locations are uploaded to the blockchain for easy traceability.

打分单元103中，第一分数为一分，第二分数为五分。In the scoring unit 103, the first score is one point, and the second score is five points.

将打分单元103打分后的数据送入获取单元104，通过获取单元104中的模型获取对应的所述文字信息。The data scored by the scoring unit 103 is sent to the obtaining unit 104 , and the corresponding text information is obtained through the model in the obtaining unit 104 .

获取单元104中，由于每条数据有不同的重要度打分，重要度打分为N，则该条数据进行本地训练时的训练轮数为N次。In the obtaining unit 104, since each piece of data has a different importance score, and the importance score is N, the number of training rounds when the piece of data is locally trained is N times.

本发明的方案，提出一种多源异构数据预处理方法，通过基于数据类别对多源异构数据进行分组，并获取多源异构数据对应的文字信息，以及获取与文字信息对应的字向量，形成字向量集合，解决了多种类型的数据难以融合分析的问题，通过融合字向量集合为一维向量，并根据一维向量进行联邦学习共建模型，能够简化政府机关人员工作流程，提高工作质量和效率。将分数与数据存储位置上传区块链，能够保障数据的可信溯源。保障政务数据的安全不可篡改。使用联邦学习共建政务模型，能够保障关键政务数据的隐私。The solution of the present invention proposes a multi-source heterogeneous data preprocessing method, by grouping multi-source heterogeneous data based on data categories, and obtaining text information corresponding to the multi-source heterogeneous data, and obtaining the text information corresponding to the text information. vector, forming a set of word vectors, which solves the problem that various types of data are difficult to integrate and analyze. By fusing the set of word vectors into one-dimensional vectors, and building a model based on federated learning based on the one-dimensional vectors, it can simplify the workflow of government personnel. Improve work quality and efficiency. Uploading the score and data storage location to the blockchain can ensure the trusted traceability of the data. Ensure the security of government data and cannot be tampered with. Using federated learning to build a government model can ensure the privacy of key government data.

其中，区块链是一个信息技术领域的术语。从本质上讲，它是一个共享数据库，存储于其中的数据或信息，具有“不可伪造”“全程留痕”“可以追溯”“公开透明”“集体维护”等特征。Among them, blockchain is a term in the field of information technology. In essence, it is a shared database, and the data or information stored in it has the characteristics of "unforgeable", "full traces", "traceable", "open and transparent" and "collective maintenance".

联邦学习，即联邦机器学习，又名联合学习，联盟学习。联邦机器学习是一个机器学习框架，能有效帮助多个机构在满足用户隐私保护、数据安全和政府法规的要求下，进行数据使用和机器学习建模。Federated learning, namely federated machine learning, also known as federated learning, federated learning. Federated Machine Learning is a machine learning framework that can effectively help multiple agencies conduct data usage and machine learning modeling while meeting user privacy protection, data security, and government regulations.

根据本发明的实施例，还提供了对应于多源异构数据预处理装置的一种计算机设备，该计算机设备包括：如上述的多源异构数据预处理装置。According to an embodiment of the present invention, a computer device corresponding to the multi-source heterogeneous data preprocessing device is also provided, the computer device comprising: the multi-source heterogeneous data preprocessing device as described above.

由于本实施例的计算机设备所实现的处理及功能基本相应于前述装置的实施例、原理和实例，故本实施例的描述中未详尽之处，可以参见前述实施例中的相关说明，在此不做赘述。Since the processing and functions implemented by the computer equipment in this embodiment basically correspond to the embodiments, principles and examples of the foregoing apparatus, the description of this embodiment is not detailed, and reference may be made to the relevant descriptions in the foregoing embodiments. I won't go into details.

采用本发明的技术方案，通过基于数据类别对多源异构数据进行分组，并获取多源异构数据对应的文字信息，以及获取与文字信息对应的字向量，形成字向量集合，解决了多种类型的数据难以融合分析的问题，通过融合字向量集合为一维向量，并根据一维向量进行联邦学习共建模型，能够简化政府机关人员工作流程，提高工作质量和效率。将分数与数据存储位置上传区块链，能够保障数据的可信溯源。保障政务数据的安全不可篡改。使用联邦学习共建政务模型，能够保障关键政务数据的隐私。By adopting the technical scheme of the present invention, the multi-source heterogeneous data is grouped based on the data category, the text information corresponding to the multi-source heterogeneous data is obtained, and the word vector corresponding to the text information is obtained to form a word vector set, which solves the problem of multi-source heterogeneous data. To solve the problem that it is difficult to integrate and analyze various types of data, by fusing word vectors into one-dimensional vectors, and building a model based on federated learning based on the one-dimensional vectors, it can simplify the workflow of government personnel and improve work quality and efficiency. Uploading the score and data storage location to the blockchain can ensure the trusted traceability of the data. Ensure the security of government data and cannot be tampered with. Using federated learning to build a government model can ensure the privacy of key government data.

根据本发明的实施例，还提供了对应于多源异构数据预处理方法的一种存储介质，所述存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行上述的多源异构数据预处理方法。According to an embodiment of the present invention, a storage medium corresponding to a multi-source heterogeneous data preprocessing method is also provided, wherein the storage medium includes a stored program, wherein when the program runs, a device where the storage medium is located is controlled Execute the multi-source heterogeneous data preprocessing method described above.

由于本实施例的存储介质所实现的处理及功能基本相应于前述方法的实施例、原理和实例，故本实施例的描述中未详尽之处，可以参见前述实施例中的相关说明，在此不做赘述。Since the processing and functions implemented by the storage medium in this embodiment basically correspond to the embodiments, principles, and examples of the foregoing methods, the descriptions of this embodiment are not detailed, and reference may be made to the relevant descriptions in the foregoing embodiments. I won't go into details.

综上，本领域技术人员容易理解的是，在不冲突的前提下，上述各有利方式可以自由地组合、叠加。To sum up, those skilled in the art can easily understand that, on the premise that there is no conflict, the above advantageous manners can be freely combined and superimposed.

以上所述仅为本发明的实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的权利要求范围之内。The above description is only an embodiment of the present invention, and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the scope of the claims of the present invention.