Disclosure of Invention
The invention relates to a medical data joint learning system and method based on trusted computing and privacy protection, provides a whole set of service system based on medical big data security sharing, trusted computing, deep mining, authority authentication and multi-platform joint learning, and solves the problems of scattered, single and incomplete medical data privacy protection and data mining at the present stage.
In order to achieve the above object, one technical solution of the present invention is to provide a medical data joint learning method based on trusted computing and privacy protection:
the joint learning central control layer receives and stores the non-sensitive meta-information uploaded by the data contributor through the data contributor management layer of the data node where the data contributor is located; the meta-information is based on the original data of the data contributors and does not contain sensitive information of the original data;
the joint learning center control layer receives and processes a joint learning request initiated by a data miner through the data miner interaction layer; in a safety calculation area of a control layer of the joint learning center, intermediate results obtained by performing local isolation calculation on original data of each data node are summarized and analyzed, and the joint learning results are transmitted back to an interaction layer of a data miner.
Optionally, the joint learning central control layer is provided with a central node server and a security computation server, and interacts with data node servers respectively arranged on the data contributor management layers of the data nodes;
the medical data joint learning method comprises the following processes:
step one, registering and storing all original data of data contributors in a local firewall; the data contributors access the data node server through the first interactive system, register the data set and designate the access authority and the valid time of the data set; all the original data are stored in a local private database and are positioned in a firewall; the data node server sends the meta information to a central node server for recording;
secondly, the data miner accesses the central node server through the second interactive system, searches available data sets based on own authority after completing user registration and verification, and creates a joint learning example;
thirdly, the data miner sends a joint learning request to the central node server;
fourthly, based on the data set selected by the data miner, the central node server sends local calculation requests to all data nodes related to the current joint learning request;
fifthly, receiving the data nodes of the local calculation request, performing local isolation calculation based on original data in a firewall through respective data node servers, and performing intermediate result interaction with a safety calculation server; the intermediate result does not contain raw data;
sixthly, the safety calculation server collects and updates intermediate results obtained by local isolation calculation of all the data nodes, generates and outputs a joint learning result, and returns the joint learning result to the central node server;
and seventhly, the central node server generates a joint learning report to support the data miner to obtain and use the joint learning result.
Optionally, the data contributor performs setting of access right in the data registration process through a data contributor management layer of the data node where the data contributor is located;
the access rights specify one or more of a time, a place, a data miner, and a joint learning task that allow use of the data.
Optionally, the data miner selects the data with the data authority disclosed, and/or the data contributor designates the data of the data miner to perform joint learning;
the data miners set the own joint learning examples to be private or public, and other data miners are allowed to inquire and research the public joint learning examples.
Optionally, the meta information and the intermediate result of each data node are uploaded to the joint learning central control layer in an encrypted state.
Optionally, before uploading the meta-information to the central node server, the data node may initiate remote enclave authentication based on intel software protection extension service to the central node server;
and the safety calculation server uses Intel software protection extension service to collect and analyze the intermediate results uploaded by each data node.
Optionally, the meta information includes a network protocol address and a port of the data node server, a file name, description, and a supported research method of the original data; the intermediate result does not relate to sensitive information of the original data; the intermediate result comprises an intermediate training model and statistical parameters.
The medical data joint learning system based on the trusted computing and the privacy protection can be applied to any one of the medical data joint learning methods based on the trusted computing and the privacy protection.
The medical data joint learning system comprises:
the data node servers are arranged on the data contributor management layer of each data node;
the central node server and the safety calculation server are arranged at a joint learning central control layer and interact with each data node server;
the data node server registers a local data set and assigns access authority, uploads meta-information to a central node server for recording, receives a local calculation request of the central node server, performs local isolation calculation on locally stored original data, and sends an intermediate result to a security calculation server for summarizing;
the central node server receives a joint learning request initiated by a data miner, informs a safety calculation server of a joint learning example created by the data miner, sends a local calculation request to a data node related to the current joint learning request, waits for and receives joint learning results collected and summarized by the safety calculation server from the corresponding data node, generates a joint learning report and returns the joint learning report to the data miner.
Optionally, the data node server implements a management framework by using Spring + Vue, and implements local isolation computation by C + +;
the central node server realizes a control architecture by using Spring boot + Vue and is deployed on a hardware platform provided with a Docker-Complex by using a Docker technology;
the safe computing server uses C + +/Rust in combination with Intel software safety extension service.
Optionally, the data node server, the local private database, and the first web page end interaction system configured by the data contributor management layer are located within a local firewall of the data node where the data node is located;
based on a first webpage end interactive system, a data contributor accesses a data node server through a browser;
and the data miner interaction layer is provided with a second webpage end interaction system, and the data miner accesses the central node server through a browser.
Compared with the prior art, the medical data joint learning system and method based on the trusted computing and privacy protection have the advantages that:
the scheme of the invention is implemented by a central node server of a central control layer of the joint learning, a safe computing server (a trusted computing area) and a plurality of data node servers of a data contributor management layer based on the joint learning. All the storage related to the original medical data is performed in a local isolation mode on the data nodes, and privacy disclosure is avoided fundamentally. The present invention enables strict and flexible authorization authentication of data sets, including but not limited to task, user, time and location based authorization. The central node stores the non-sensitive meta-information of the data set, and deep mining of the medical data is achieved by using a series of joint learning algorithms. Meanwhile, the central node joint learning core program uses Intel SGX software protection extension service, and the safety of the calculated data and results in an untrusted environment is ensured.
Detailed Description
The principles, features and system flow of the present invention are described below in conjunction with the drawings, which are set forth by way of example only and not intended to limit the scope of the invention.
As shown in fig. 1, the medical data joint learning scheme based on trusted computing and privacy protection includes three major parts:
first, a data contributor management layer;
the local management layer enables localized registration, storage, and computation of all raw medical data by data contributors (e.g., medical big data owners in hospitals, medical research institutions, etc.). Specifically, all of the raw data of the data contributors is completely registered and stored locally (within the firewall). At the same time, all calculations involving the raw data are also limited to being performed in local isolation. The design radically avoids the external leakage of the privacy data.
The local management layer only uploads the meta information of the original data, such as the network protocol address (IP address) and port of the local server, the file name of the original data, the description and the supported research method, to the central node server of the joint learning central control layer. Meanwhile, in the local isolation calculation process, only intermediate results (such as intermediate training models and statistical parameters) are transmitted to a safety calculation region of a central control layer of the joint learning for safety summarization.
The intermediate data does not relate to any data privacy information. For example, in an analysis of variance (ANOVA) test, the local server returns only the average and data volume in the local data set, and the central node server calculates the overall average and data volume from these values and returns it to the local server. The local server calculates the square of the difference between the local value and the integral average value according to the values, and then returns the square to the central node server, and the central node server obtains the relevant values and then calculates to obtain the F statistic value, so that the p-value of the test can be obtained in the F distribution.
It is emphasized here that the intermediate results of the calculations are all transmitted, stored and calculated in an encrypted state. Even if the central node server is hijacked, the state and data of the calculation cannot be disclosed.
In the data registration process, the invention designs a strict and flexible access authority control mechanism. Such as authorization based on a joint learning task, authorization based on a data set validity time, authorization based on a specified data miner, authorization based on a geographic location/research institution, and so forth. Specifically, the data contributors can specify who, at what times, and at what locations, use their own provided data sets to conduct joint learning studies of the specified methods.
Before uploading the meta-information to the central node server, the local server initiates remote enclave authentication based on the intel SGX trusted computing unit to the central node server to verify whether the trusted computing unit of the central node server has been trusted registered in the intel verification server. Therefore, the privacy and the safety of the meta-information and the intermediate calculation result in the transmission, storage and calculation processes are ensured.
Second, the Joint learning Central control layer
The central node server is responsible for data registration of data contributors, meta-information storage (without involving any raw data), and processing of data miner joint learning requests. The safety calculation server uses Intel software protection extension Service (SGX) to collect and analyze the intermediate results of local calculation at the cloud end, and finally the results are transmitted back to the interaction layer of the data miner, and a joint learning result report is generated at the browser end.
The encrypted intermediate results uploaded by each medical data node are loaded to a core program of the central node server to be encrypted and summarized to obtain a final learning result. The core program of the invention uses SGX service provided by Intel, all operations are encrypted in a trusted computing area, thereby greatly improving the safety of program operation and realizing the privacy, integrity and usability of codes and data. Specifically, the kernel only trusts the CPU of the kernel and Intel, and effectively prevents the attack to the kernel after the bottom layer OS (operating system) is clamped. While administratively not trusting the provider of the cloud service.
Thirdly, a data miner interaction layer;
the interaction layer of the data miner is provided with a webpage end interaction system, the data miner can access the joint learning interaction system through a browser to complete user registration, and after verification, the data miner can select data with public data authority or data which is appointed by a certain data contributor to the data miner to perform joint learning of different algorithms. Such as the chi-square test, proportional risk regression, analysis of variance algorithms, and the kolmogorov-smirnov test, among others. Meanwhile, the data miner can also choose to set the own joint learning instance as public or private. The disclosed joint learning examples may also be queried and studied by other data miners.
The invention uses a 'united Learning' (Federated Learning) model to realize the safe sharing and deep mining of medical data. As shown in fig. 1, the joint learning model performs local operations using the servers of the medical data contributors themselves, and only uploads encrypted intermediate results (statistical information, intermediate training models, etc.) to the central node server for security aggregation, and all training data (original data) are retained in the original respective devices.
That is, the data contributors have ownership of the data, the original data remains local, and the objects for search or analysis may all be encrypted data. The data miner can execute encryption retrieval to ensure the privacy of a search target; the data contributors can select rental data and adjust prices according to market demands; if the retrieval results are matched, the data miner can select to lease corresponding data to perform joint learning analysis, and the encrypted analysis parameters and the joint learning operation results can only be extracted and checked by the data miner. Data contributors may choose to deregister registered data at any time. Once revoked, the encryption key is destroyed and the data miner cannot continue to use the data.
Illustratively, the data contributors of the invention are respectively provided with a local data management interactive system, which comprises a data node server arranged in a local firewall of the data node server, and a local private database and a webpage end interactive system which interact with the data node server. And the data node server is further interacted with a central node server and a safety calculation server of the joint learning central control layer.
The data node server uses Spring + Vue to realize a management layer (architecture priority), and C + + realizes local isolation calculation (speed priority). At the data contributor management level, data contributors upload local data collections (fig. 7, fig. 8), specify access rights (e.g., time, place, personnel, task based restrictions), register data meta-information (fig. 9) to the central node server. In order to realize local isolation calculation, the data node server receives the local isolation calculation request (figure 4) of the central node server, performs joint learning local isolation calculation of a corresponding method, and sends an intermediate result to the security calculation server for summarizing.
Taking a proportional risk regression model as an example, a DF first derivative matrix and a DDF second derivative Hessian matrix are obtained through local isolation calculation and are sent to a safety calculation server, the safety calculation server returns a non-convergence coefficient matrix, and the two parties repeat the operation until the convergence condition is met.
Note that the data node server and the secure compute server communicate prior to remote verification based on intel enclave authentication techniques.
The example central node server realizes a joint learning central control layer by using a Spring boot + Vue architecture, and can be rapidly deployed on any hardware platform provided with a Docker-Complex by using a Docker technology. The central node server is responsible for receiving the joint learning request of the data miner (fig. 2), informing the security computation server of the joint learning instance (fig. 3), sending the local isolation computation request to the data node cluster involved in the joint learning (fig. 4), waiting for and receiving the joint learning result collected (from the data node cluster) and summarized by the security computation server (fig. 5), generating a joint learning result report and returning the joint learning result report to the data miner (fig. 6).
The example secure compute server (trusted compute zone), using C + +/Rust in conjunction with intel software security extension Service (SGX), accepts the central node server joint learning request (fig. 3), aggregates the local isolation computation results from the data node cluster (taking proportional risk regression model as an example, the intermediate results include the unconverged coefficient matrix, the DF first derivative matrix, and the DDF second derivative hessian matrix), computes the final results and sends them to the central node server (fig. 5).
The following is an example of a specific service flow of the present invention:
in the first step, the data contributor registers the data set (fig. 7 and 8) through the local data node server, and specifies the access right, the valid time and the like of the data set. All the original data are stored in a local private database and are positioned in a firewall. Meanwhile, the data node server initiates enclave authentication to the central node server, and after confirming the secure encrypted computing environment, the encrypted meta-information (fig. 9) is sent to the central node server for record.
And secondly, completing user registration by the data miner through an interactive system, searching an available data set based on self authority after verification, and creating a joint learning example.
Third, the data miner initiates a joint learning request to the central node server (fig. 2).
Fourth, the central node server sends out local computation requests to all data nodes involved in this joint learning (based on the data set selected by the data miner) (fig. 4).
And fifthly, each data node performs local isolation calculation and interacts with an intermediate result (not related to original data) of the safety calculation server.
For example, in the joint learning of the proportional risk regression testing model, (1) the local isolation calculation calculates a DF first derivative matrix and a DDF second derivative Hessian matrix according to original data, and then sends the DF first derivative matrix and the DDF second derivative Hessian matrix to the safety calculation server; (2) and the safety calculation server calculates the unconverged coefficient matrix and returns the unconverged coefficient matrix to the data node server. And (3) repeating the operations (1) and (2) until a convergence coefficient condition is met. The data transmission of the process only relates to a derivative matrix and an unconverged parameter matrix of the original data, and does not contain any original data information.
Meanwhile, all intermediate results (reciprocal matrix, non-convergence parameter matrix and the like) are transmitted in an encrypted state, and are decrypted and calculated in a trusted calculation area. Even if the cloud server deploying the secure computing server is hijacked by an attacker, intermediate results cannot be leaked.
And sixthly, the safety calculation server collects and updates the local isolation calculation results of all the data nodes, generates and outputs a final joint learning result (figure 5), and returns the final joint learning result to the central node server.
Seventhly, the central node server generates a joint learning report (figure 6), and the data miner inquires or prints the joint learning result.
Fig. 2 to 9 take joint learning of the proportional-risk regression model as an example:
FIG. 2 is an example of a data format of a joint learning request submitted by a data miner via a browser. The joint learning request of the data miner, for example, provides data attribute information of the joint learning method: a selected attribute parameter list (including attribute name, whether classification is possible, attribute value, etc.); data node information: the unique identifier of the data node (including the unique identifier of the data node data set, the literal name of the data set, etc.), and the literal description of the data node; joint learning instance information: name, whether to publish, start time, expected end time, remark description, unique identifier of the user to which the joint learning belongs, and the like.
Fig. 3 is an example of parameters of a security computing server notified by a central node server of joint learning examples, including a joint learning unique identifier, joint learning task attributes corresponding to each method, and a data node list (including a unique identifier of a data node, a network address and a port, a current state of joint learning, and the like).
Fig. 4 is an example of a data format of a local computation request sent by a central node server to a data node, where the data format includes a file name of a local data set, a local isolation computation attribute list (including attribute values, attribute names, and information on whether the attributes can be classified), and a local unique identifier of the data set.
Fig. 5 is an example of a data format of the joint learning result collected by the security computation server and sent to the central node server, and includes a joint learning data set attribute list, a correlation coefficient, a Z test value, a P probability value, and the like.
FIG. 6 is an example of a joint learning report generated by the central node server and returned to the data miners, including a joint learning summary (including joint learning name, creator, detailed description, disclosure rights, creation time, completion time, etc.); joint learning parameters (including attribute names, data nodes participating in the joint learning, and the like); and (4) combining the learning results (including attribute names, correlation parameters, P probability values, Z test values and the like).
Fig. 7 is an example of basic information of a data set stored by a data node server, which includes a data set (including a data set local database unique identifier, a data combination name, a data set description, and the like), a data set support method (such as a specific support method, a public authority, a data set file name, an authorized user, an authorized mechanism, an authorized start/end time, and the like), and a data set summary (including an attribute list, a data amount, an attribute classification number, a classification value, and the like).
Fig. 8 is an example of raw data stored by the data node server, which includes a list of attributes, whether to classify, attribute values, and the like. Fig. 9 is an example of data set meta-information of a data node server registered with a central node server, which includes a data set meta-information list: metadata including each data set (such as classification possibility, attribute list, data set file name, local database unique identifier, supported joint learning method, data set name, classification possibility, data set description, classification to which attribute belongs, effective start date, etc.); a data node name; a data node description; a data node pass token; a data node network address and port; data node user name, etc.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.