Background
The multi-party computation is a multi-party computation method aiming at the situation that the computation of multi-party collaboration is safely carried out under the condition that no trusted third party exists. Multi-party computing allows multiple data owners to perform collaborative computing on the basis of private data to extract the value of the data without revealing the original data of each data owner. With the rapid development of various emerging technologies such as cloud computing and artificial intelligence and the enhancement of data privacy and security protection, the role of multi-party computing in various fields becomes more and more important.
Homomorphic encryption provides a function of processing encrypted data, and besides basic encryption operation, the homomorphic encryption can also realize various calculation functions among ciphertexts, namely calculation before decryption can be equivalent to calculation after decryption. That is, others can process the encrypted data, but the process does not reveal any of the original content. Meanwhile, the user with the key decrypts the processed data to obtain the processed result.
At present, the common multi-node combined model training method comprises two types of cooperative machine learning and federal learning.
The cooperative machine learning technology is characterized in that data are trained on different nodes respectively to construct a combined model, the main process is that a participating user downloads a current prediction model firstly, then local training data are used for training and improving the model, improved model parameters are uploaded to a main control node in a safe encryption transmission mode, and the main control node automatically merges the latest model. This method of machine learning in collaborative mode overcomes the problem of training in a large number of centralized data sets, such that high-strength iterations require low-latency, high-throughput environments. But there is a very different environment in the environment of the cooperative mode: data is distributed on thousands of mobile terminals with different specifications, and the terminals have high network delay, low network throughput and even intermittent online time, so that continuous online cannot be guaranteed.
The other is a federal learning technology, in the existing federal learning technology, the feature processing mode is to perform feature processing on each client respectively, and original data interaction cannot be performed among the clients, so that the feature processing of the federal model cannot know the overall view of data and utilize complete data characteristics; for the model evaluation part, the solution is that each participant trains a model by using local training data, and test data is used for evaluating the generalization ability of the model, but different models can be obtained by different data set division modes under the method, namely, the problem that the model performance is sensitive to the data set division mode exists; for parameter adjustment, in the prior art, a hyper-parameter combination of a model is fixed, a federal model is obtained by training, then a group of hyper-parameter combinations is manually replaced, the model is obtained by continuous training, and finally, model effects obtained by comparing different parameters are compared to obtain an optimal parameter combination. Namely, the federal learning needs to be operated manually for many times, so that the problems of difficult model optimization and low efficiency exist.
No matter the collaborative machine learning technology or the federal learning technology, data calculation is carried out on each node, and the characteristic of centralized data is not provided, so that the whole appearance of the data cannot be known. Secondly, the two technologies are based on model training of plaintext data, an effective safety mechanism is not provided, and the problem of serious data leakage caused by plaintext data training cannot be solved; third, the above two techniques also cause a problem of poor model accuracy because they do not train on the full amount of data at the time of model training, but train on the basis of local data.
Disclosure of Invention
In order to solve the technical problems, the application provides a safe cross-domain model training method and system based on multi-party participation.
The application is realized by the following technical scheme:
a safe cross-domain model training method based on multi-party participation comprises the steps of firstly, carrying out data preprocessing on original data of participating nodes; then homomorphic encryption is carried out on the preprocessed data; the ciphertext data is transmitted through the network through the cooperative communication module; the master control node section cooperative communication module receives communication data of the participating nodes and processes the data; performing combined model calculation on the ciphertext data and the main control node data, and performing model optimization aiming at a calculation result; and sending the parameter information of the model optimization to a message server, and sending the model optimization parameters to all the participating nodes by the message server.
The method and the device make full use of the characteristics of multi-party calculation and the homomorphism of homomorphic encryption, encrypt the original data of the participating nodes into ciphertext data through multi-party calculation, perform joint model calculation with the plaintext data of the main control node, and broadcast the model optimization parameters to all the participating nodes through the message server, so that the continuous optimization and iteration processes of the joint model of the main control node are continuously realized, and in the calculation process of the whole joint model, the participating nodes participate in the ciphertext data, and the safety of the data is ensured.
Compared with the prior art, the method has the following beneficial effects:
in the whole model joint training process, the homomorphic encryption characteristic is fully utilized, homomorphic encryption of the original data of the participating nodes is firstly realized, then the encrypted data of the participating nodes and the plaintext data of the main control node are subjected to joint model training, and the data of the participating nodes are ciphertext data subjected to homomorphic encryption in the whole joint model training process, so that the data safety of the participating nodes is ensured;
according to the method, the original data of the participating nodes are encrypted into ciphertext data through multi-party calculation by utilizing the characteristics of multi-party calculation and the homomorphism of homomorphic encryption, the joint model calculation is carried out on the ciphertext data and the plaintext data of the main control node, and model optimization parameters are broadcasted to all the participating nodes through the message server, so that continuous optimization and iteration of the joint model of the main control node are realized, and the precision of the model is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments. It is to be understood that the described embodiments are only a few embodiments of the present invention, and not all embodiments.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In addition, the embodiments of the present invention and the features of the embodiments may be combined with each other without conflict. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
As shown in fig. 1 to fig. 8, the method for training a secure cross-domain model based on multi-party participation disclosed in this embodiment includes the following steps:
step S101, each participating node preprocesses original data according to the requirement of master control node combined model training; the method specifically comprises the following steps:
step S10101, inputting the original data into a data analyzer to analyze the original data of the participating nodes, and analyzing and classifying the data according to three categories of structured data, semi-structured data and unstructured data;
and step S10102, inputting the data after data analysis into a data converter, and performing data conversion on the analyzed data according to the requirements of the joint model calculation.
Step S102, a homomorphic encryption module is carried out on the preprocessed data to form ciphertext data; the step utilizes the homomorphism of homomorphic encryption, and assumes that the encryption function and the decryption function of an encryption system are respectively
And
wherein
And C are a plaintext space and a ciphertext space, respectively; order to
And
algebraic or arithmetic operations defined in plaintext space and ciphertext space, respectively. The homomorphism of the encryption scheme is defined as: given arbitrary two
If the encryption function and the decryption function of an encryption system satisfy an algebraic relationship
Or
The encryption system is said to be homomorphic.
Step S103, the cooperative communication module of each participating node prepares before multiparty computation, firstly, ciphertext data is loaded into the cooperative communication module, and then computation information and addresses are continuously loaded.
And step S104, carrying out network transmission on the ciphertext data.
And step S105, the cooperative communication module of the main control node receives the ciphertext data of the participating node.
Step S106, the master control node performs data processing before joint modeling calculation on the received ciphertext data of the participating nodes; the method comprises the following specific steps: after receiving the ciphertext data of the participating nodes, the data processing module firstly analyzes the ciphertext data, analyzes the data according to a protocol format, then performs data conversion, and performs corresponding data conversion according to the requirement of the joint model calculation.
And S107, performing combined model training by using the ciphertext data of the participating node and the plaintext data of the main control node, and then obtaining a training result.
Step S108, carrying out model optimization according to the calculation result of the combined model;
in the model optimization process, firstly, a model training result is evaluated, the quality of a training model is evaluated according to preset evaluation parameters of various models, and then parameters of the good model are obtained.
Step S109, sending the parameter information of model optimization to a message server, and sending the model optimization parameters to all the participating nodes by the message server;
based on the above cross-domain model training method, the present application also discloses a safe cross-domain model joint training system, which can be used to implement the above method, and the cross-domain model joint training system includes:
the data preprocessing module is used for preprocessing the original data of the participating nodes;
the homomorphic encryption module is used for homomorphic encryption of the preprocessed data;
the cooperative communication module of the participating node is used for transmitting the ciphertext data of the participating node to the main control node;
the master control node cooperative communication module is used for receiving ciphertext data of the participating nodes;
the data processing module is used for processing the received ciphertext data;
the joint model training module is used for performing joint model training by using the ciphertext data of the participating nodes and the plaintext data of the main control node;
the model optimization module is used for optimizing the model;
and the message server is used for sending the model optimization parameters to all the participating nodes.
The message server firstly processes the messages sent by the model optimization module, classifies the messages according to message types, and then sends the messages to the participating nodes according to message requirements.
According to the method and the device, the security of the data participating in calculation is ensured by homomorphic encryption of the multi-party data participating in the joint modeling, the joint modeling of the multi-party ciphertext data and the main control node plaintext data is realized based on the homomorphic encryption characteristic, and the safe joint model training is realized on the premise of ensuring the security and privacy of the data of each party.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The above embodiments are provided to explain the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.