CN111737753B

Movatterモバイル変換

Info

Publication number: CN111737753B
Application number: CN202010722393.2A
Authority: CN
Inventors: 陈超超; 周俊; 王力; 郑龙飞
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-12-22
Anticipated expiration: 2040-07-24
Also published as: CN111737753A

Abstract

The embodiment of the specification provides a two-party data clustering method, device and system based on data privacy protection. At each data owner, each data sample in the respective data set is divided into two data shares. Each data owner shares one of the two data shares of each data sample that is cut out to the other data owner. At each data owner, a reconstituted data set of the data owner is obtained based on the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner, respectively. And clustering data among the data owners by using the recombined data sets of the data owners.

Description

Two-party data clustering method, device and system based on data privacy protection

Technical Field

The embodiments of the present specification generally relate to the field of data clustering, and in particular, to a method, an apparatus, and a system for data clustering based on data privacy protection.

Background

Data clustering is a very common technique in machine learning. It is often applied to tasks such as community discovery, anomaly detection, and the like. Examples of data clustering algorithms may include the k-Means algorithm. The k-Means algorithm is an unsupervised learning algorithm that aims to classify similar objects into the same cluster. The more similar the objects within a cluster, the better the clustering.

When data clustering is performed, a central point needs to be revealed in the clustering process, so that private data or private information possessed by a data owner can be revealed.

Disclosure of Invention

In view of the foregoing, embodiments of the present specification provide a method, an apparatus, and a system for clustering data on two sides based on data privacy protection, which can implement data clustering while ensuring security of respective private data of two data owners.

According to an aspect of an embodiment of the present specification, there is provided a two-party data clustering method based on data privacy protection, including: at each data owner, respectively dividing each data sample in each data set into two data shares; each data owner shares one of the two data shares of each data sample to the other data owner; at each data owner, obtaining a restructured data set of the data owner based on the data shares of each data sample reserved by the data owner and the data shares of each data sample obtained from another data owner respectively; and clustering data among the data owners by using the recombined data sets of the data owners.

Optionally, in an example of the above aspect, the data set owned by each data owner is a horizontally sliced data set, and obtaining, at each data owner, a restructured data set of the data owner based on data shares of each data sample retained by the data owner and data shares of each data sample obtained from another data owner includes: and transversely splicing the data shares of the data samples reserved by the data owner and the data shares of the data samples obtained from another data owner at each data owner to obtain a recombined data set of the data owner.

Optionally, in an example of the above aspect, the data set owned by each data owner is a vertically sliced data set, and obtaining, at each data owner, a restructured data set of the data owner based on data shares of each data sample retained by the data owner and data shares of each data sample obtained from another data owner includes: and longitudinally splicing the data shares of the data samples reserved by the data owner and the data shares of the data samples obtained from another data owner at each data owner to obtain a recombined data set of the data owner.

Optionally, in one example of the above aspect, clustering data using the regrouped data sets of the respective data owners, among the respective data owners, comprises: the following processes are executed in a loop until the cluster category center point is not changed any more: determining sample distances between each data sample in the data set owned by each data owner and each current cluster category center point by using the recombined data set of each data owner among the data owners; according to the sample distance between each determined data sample and each current clustering category central point, performing data clustering again on each data sample; and updating each current clustering category central point according to the secondary data clustering result, wherein when the clustering category center is changed, the updated clustering category central point is used as the current clustering category central point of the next cycle process.

Optionally, in an example of the above aspect, the sample distance between each data sample in the data set owned by each data owner and each current cluster category center point is between each data owner, and the determining is performed by performing a multi-party security calculation using the regrouped data set of each data owner, each data owner having a distance share of each sample distance.

Optionally, in an example of the foregoing aspect, performing data clustering again on each data sample according to a sample distance between each determined data sample and each current clustering class center point includes: for each data sample in the data set owned by each data owner, the data sample is clustered again according to a comparison protocol based on secret sharing using the distance shares for each sample distance of the data sample that each data owner has.

Optionally, in an example of the above aspect, for each data sample in the data set owned by each data owner, re-clustering the data sample according to a comparison protocol based on secret sharing, using distance shares, respectively owned by each data owner, for each sample distance of the data sample for the data sample comprises: for each data sample in the data set owned by each data owner, comparing the distance of each sample according to a comparison protocol based on secret sharing by using the distance share of each data owner for each sample distance of the data sample; and according to the comparison result of the sample distance, clustering the data sample to the cluster class to which the current cluster class central point with the minimum sample distance belongs.

Optionally, in an example of the above aspect, the sample distance size comparison result is a category vector, and the category vector is stored at each data owner in a secret sharing manner.

Optionally, in an example of the above aspect, updating the respective current cluster category center points according to the re-clustering result includes: updating each current cluster class center point using a multi-party safe computation based on the class vector shares of each data sample that each data owner has.

According to another aspect of embodiments of the present specification, there is provided a two-party data clustering method based on data privacy protection, the two-party data clustering method being applied to a data owner, the two-party data clustering method including: each data sample in the data set is divided into two data shares; sharing one of the two data shares of each of the sliced data samples to the other data owner; obtaining a restructured data set of the data owner based on the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from another data owner; and performing data clustering between the data owners by using the restructuring data sets of the data owners, wherein the restructuring data set of the other data owner is obtained based on the data shares of the data samples reserved by the other data owner and the data shares of the data samples acquired from the data owner.

According to another aspect of embodiments of the present specification, there is provided a two-party data clustering apparatus based on data privacy protection, the two-party data clustering apparatus being applied to a data owner, the two-party data clustering apparatus including: the data segmentation unit is used for segmenting each data sample in the data set into two data shares; a share sharing unit that shares one of the two data shares of each of the cut data samples to the other data owner; the data set reorganization unit is used for obtaining the reorganized data set of the data owner based on the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner; and a data clustering unit that performs data clustering between the data owners using the regrouped data sets of the data owners, wherein the regrouped data set of the other data owner is obtained based on the data shares of the data samples retained by the other data owner and the data shares of the data samples acquired from the data owner.

Optionally, in an example of the above aspect, the data set owned by each data owner is a horizontally sliced data set, and the data set reorganizing unit laterally concatenates data shares of each data sample retained by the data owner and data shares of each data sample acquired from another data owner to obtain a reorganized data set of the data owner.

Optionally, in an example of the above aspect, the data set owned by each data owner is a vertically-sliced data set, and the data set reorganizing unit longitudinally concatenates data shares of each data sample retained by the data owner and data shares of each data sample acquired from another data owner to obtain a reorganized data set of the data owner.

Optionally, in an example of the above aspect, the data clustering unit includes: the sample distance determining module is used for determining the sample distance between each data sample in the data set owned by each data owner and each current clustering class central point by using the recombined data set of each data owner among the data owners; the data clustering module is used for clustering data of each data sample again according to the sample distance between each determined data sample and each current clustering class central point; and the central point updating module updates each current clustering category central point according to the data clustering result again, wherein the sample distance determining module, the data clustering module and the central point updating module are executed in a circulating way until the clustering category central points are not changed any more, and when the clustering category central points are changed, the updated clustering category central points are used as the current clustering category central points in the next circulating process.

Optionally, in an example of the above aspect, the sample distance determination module performs a multi-party security calculation between the respective data owners using the regrouped data sets of the respective data owners to determine sample distances between respective data samples in the data sets of the respective data owners and respective current cluster category center points, the respective data owners having respective distance shares of the respective sample distances.

Optionally, in an example of the above aspect, for each data sample in the data set owned by each data owner, the data clustering module performs data clustering again on the data sample according to a comparison protocol based on secret sharing, using distance shares that each data owner has for each sample distance of the data sample.

Optionally, in an example of the above aspect, for each data sample in the data set owned by each data owner, the data clustering module compares the size of each sample distance of the data sample according to a comparison protocol based on secret sharing, using a distance share that each data owner has for the respective sample distance; and according to the comparison result of the sample distance, clustering the data sample to the cluster class to which the current cluster class central point with the minimum sample distance belongs.

Optionally, in an example of the above aspect, the central point updating module updates each current cluster category central point using a multi-party security computation according to a category vector share of each data sample that each data owner has.

According to another aspect of embodiments herein, there is provided a two-party data clustering system based on data privacy protection, the two-party data clustering system including: a first data owner comprising a two-party data clustering arrangement as described above; and a second data owner comprising a two-party data clustering arrangement as described above.

According to another aspect of embodiments of the present specification, there is provided an electronic apparatus including: at least one processor, and a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform a method of two-way data clustering as described above.

According to another aspect of embodiments herein, there is provided a machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform a method of two-way data clustering as described above.

Drawings

A further understanding of the nature and advantages of the present disclosure may be realized by reference to the following drawings. In the drawings, similar components or features may have the same reference numerals.

FIG. 1 shows an example schematic of a data clustering process.

FIG. 2 illustrates a block diagram of a two-party data clustering system based on data privacy protection in accordance with an embodiment of the present description.

FIG. 3 illustrates a flow diagram of a method for two-party data clustering based on data privacy protection in accordance with an embodiment of the present description.

FIG. 4 illustrates a schematic diagram of an example of a horizontally sliced dataset according to an embodiment of the present description.

FIG. 5 illustrates an example schematic diagram of a data set reorganization process of a data owner with a horizontally sliced data set according to an embodiment of the present description.

FIG. 6 illustrates a schematic diagram of an example of vertically slicing a dataset according to an embodiment of the present description.

FIG. 7 illustrates an example schematic of a data set reorganization process of a data owner with vertically slicing data sets in accordance with an embodiment of the present description.

FIG. 8 illustrates a flow diagram of a data clustering process performed among various data owners according to embodiments of the present description.

FIG. 9 illustrates a flow diagram of a process for determining distance shares of sample distances performed between various data owners, according to an embodiment of the present description.

FIG. 10 shows a flow diagram of a re-clustering process performed among various data owners, according to an embodiment of the present description.

Fig. 11 illustrates a block diagram of a two-party data clustering apparatus based on data privacy protection according to an embodiment of the present description.

FIG. 12 shows a block diagram of a data clustering unit according to an embodiment of the present description.

Fig. 13 is a schematic diagram of an electronic device for implementing a two-party data clustering apparatus based on data privacy protection according to an embodiment of the present specification.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.

As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.

The secret sharing method is a cryptographic technique that stores a secret (secret) in a decomposed manner, and splits the secret into a plurality of shares (shares) in an appropriate manner, each share being owned and managed by one of a plurality of parties (e.g., data owners), a single party being unable to recover the complete secret, only a number of parties collaborating together being able to recover the complete secret. The secret sharing method aims to prevent the secret from being too concentrated so as to achieve the purposes of dispersing risks and tolerating intrusion.

FIG. 1 shows an example schematic of a data clustering process.

As shown in fig. 1, at 110, K sample data points are selected as starting center points for K clusters. In this specification, the term "cluster" may be referred to as "cluster". The term "center point" may also be referred to as a "centroid" or "cluster center point". In one example, K data samples may be randomly selected from the data set to be clustered as starting center points of K clusters. In another example, K data samples (not members of the dataset) may also be chosen at will as starting center points for K clusters.

Then, the operations 120 to 150 are performed in a loop until the cluster center point is no longer changed.

Specifically, in each loop, at 120, the dissimilarity of the remaining elements (non-center point elements) in the data set to be clustered to k current center points is calculated, respectively. In this specification, the dissimilarity of an element from the current center point is typically characterized using the sample distance between the element and the current center point.

At 130, the elements are classified into the determined clusters with the lowest dissimilarity according to the calculated dissimilarity between the elements and the current center points.

At 140, the respective center points of the K clusters are recalculated based on the current clustering results. In one example, for each cluster, the center point of the cluster may be determined by calculating the arithmetic mean of the respective dimensions of all elements in the cluster.

At 150, it is determined whether the center point of each cluster determined has changed. If there is a change in the center point of at least one cluster, then returning to 120, the next loop process is performed for all elements in the dataset to be clustered. The currently determined center point of each cluster is used as the current center point of each cluster of the next cycle process. In this specification, for the center point of each cluster, whether the center point has changed is determined by calculating a sample distance between the determined center point and the current center point and comparing the calculated sample distance with a predetermined threshold. If the calculated sample distance is less than the predetermined threshold, the center point is considered unchanged. Otherwise, the center point is considered to be changed.

And if the central points of all the clusters are not changed, finishing the data clustering process and outputting a data clustering result.

In the data clustering process, when the sample distance is determined, complete information of a central point needs to be disclosed, and the central point may be a certain data sample in a data set, so that private data or private information possessed by a data owner is disclosed.

In view of the foregoing, embodiments of the present specification provide a two-party data clustering method based on data privacy protection. In the method, at each data owner, each data sample in the respective data set is divided into two data shares. Each data owner shares one of the two data shares of each data sample that is cut out to the other data owner. At each data owner, a reconstituted data set of the data owner is obtained based on the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner, respectively. Subsequently, data clustering is performed between the data owners using the regrouped data sets of the data owners. By the method, the data set used for data clustering is a data set recombined by data shares from two data owners owned by each data owner, so that complete information of any sample data of the two data owners cannot be leaked during data clustering, and a two-party data clustering process based on data privacy protection is realized.

A method, an apparatus, and a system for clustering data based on data privacy protection according to embodiments of the present disclosure will be described below with reference to the accompanying drawings.

FIG. 2 illustrates a block diagram of a two-party data clustering system 200 based on data privacy protection in accordance with an embodiment of the present description.

As shown in FIG. 2, the two-party data clustering system 200 includes a first data owner 210 and a second data owner 220. The first data owner 210 has a first data set and the second data owner has a second data set. The first data owner 210 and the second data owner 220 may communicate with each other via a network such as, but not limited to, the internet or a local area network.

The first data owner 210 and the second data owner 220 have two-party data clustering means 211 and 221, respectively. The first data owner 210 and the second data owner 220 data cluster the first data set and the second data set that each has via respective two-party data clustering means. The two-party data clustering process between the first data owner 210 and the second data owner 220 will be described in detail below with reference to the figures.

FIG. 3 illustrates a flow diagram of a method for two-party data clustering based on data privacy protection in accordance with an embodiment of the present description. In the example of fig. 3, two data owners, Alice and Bob, are shown. The data-owner Alice has a first data set and the data-owner Bob has a second data set.

At 310, at the data owner, Alice, each data sample in the first data set is sliced into two data shares. At the data owner Bob, each data sample in the second data set is split into two data shares. Here, the data share slicing for the data samples may be implemented in any suitable data slicing manner.

In this specification, the data sample may include a plurality of elements, each element value being a feature value corresponding to one of a plurality of dimensional features, for example, feature values corresponding to the dimensional features f1, f2, f3, and f 4. In one example, a data share may be derived by decomposing each element in a data sample into 2 element components, and selecting one element component for each element to compose a new data sample. For example, assuming that the data sample a includes 5 elements { a1, a2, a3, a4, a5}, it is possible to decompose a1 into a1= a11+ a12, a2 into a2= a21+ a22, a3 into a3= a31+ a32, a4 into a4= a41+ a42, and a5 into a5= a51+ a 52. Then, 2 data shares { a11, a21, a31, a41, a51} and { a12, a22, a32, a42, a52} are constructed.

At 320, the data owner Alice shares one of the two data shares of each sliced data sample to the data owner Bob, and the data owner Bob shares one of the two data shares of each sliced data sample to the data owner Alice.

At 330, the data owner Alice obtains a regrouped data set (third data set) for the data owner based on the data shares of the data samples that the data owner Alice retains and the data shares of the data samples that are obtained from the data owner Bob. The data owner Bob obtains a regrouping data set (fourth data set) of the data owner based on the data shares of the data samples retained by the data owner Bob and the data shares of the data samples obtained from the data owner Alice.

In one example, the first data set that the data owner, Alice, has and the second data set that the data owner, Bob, has may be horizontally sliced data sets.

FIG. 4 illustrates a schematic diagram of an example of a horizontally sliced dataset according to an embodiment of the present description. As shown in fig. 4, in the case of horizontally slicing the data set, each data sample that the data owners Alice and Bob have has the same characteristic dimension, for example, 4 characteristic dimensions f1, f2, f3, and f4, but the characteristic value of at least one characteristic dimension is different, and the data IDs of the respective data samples are different.

As shown in fig. 5, in the case of horizontally splitting the data set, when the data set is reorganized, each data owner transversely concatenates the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner, so as to obtain the reorganized data set of the data owner. Here, the horizontal splicing refers to adding a data share cut by one data owner as a new data sample to a data set of another data owner. As shown in fig. 5, after the transverse stitching, the third data set and the fourth data set have the same data ID and more data samples than the first data set and the second data set. Furthermore, for each data sample, there is a partial sample value of the data sample, respectively. And summing all sample values of the third data set and the fourth data set to obtain complete information of the data sample.

In another example, the first data set that the data owner, Alice, has and the second data set that the data owner, Bob, has may be vertically sliced data sets.

FIG. 6 illustrates a schematic diagram of an example of vertically slicing a dataset according to an embodiment of the present description. As shown in fig. 6, in the case of vertically slicing the data sets, the data IDs of the data samples owned by the data owners Alice and Bob are correspondingly the same, for example, the data owners Alice and Bob both have data samples X1, X2, and X3, but have different characteristic dimensions for each data sample with the same data ID, for example, the data samples in the first data set have 3 characteristic dimensions f1, f2, and f3, and the data samples in the second data set have three characteristic dimensions f4, f5, and f 6.

Further, it is noted that in the diagram of fig. 6, the first data set and the second data set are shown for data samples having the same data ID, all having different feature dimensions. However, in other embodiments of the present description, the first data set and the second data set may have characteristic dimensions that are partially different for each data sample having the same data ID.

As shown in fig. 7, in the case of vertically splitting the data set, when the data set is rebuilt, each data owner longitudinally concatenates the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner, so as to obtain a rebuilt data set of the data owner. Here, vertical concatenation refers to adding a data share cut by one data owner as new feature data to a data sample with a corresponding data ID in a data set of another data owner. As shown in fig. 7, after the vertical stitching, the number of data samples in the obtained third data set and the fourth data set is the same as the number of data samples in the first data set and the second data set, and each data sample has a partial sample value of the data sample. And adding all sample values of the third data set and the fourth data set to obtain complete information of the data sample.

Returning to FIG. 3, after the data set reorganization of each data owner is completed as described above, data clustering is performed between the data owners Alice and Bob using the reorganized data sets of each data owner at 340.

As shown in fig. 8, at 810, K initial cluster category center points are determined, and the determined K initial cluster category center points are sliced, for example, each initial cluster category center point Ci is sliced into center point shares < Ci >1 and < Ci >2, the data owner Alice obtains the center point share < Ci >1, and the data owner Bob obtains the center point share < Ci > 2.

Then, the operations 820 to 850 are executed in a loop until the cluster category center point is not changed any more.

Specifically, at 820, between the data owners Alice and Bob, the reorganization data sets of the respective data owners are used to determine sample distances between respective data samples in the data sets of the respective data owners and respective current cluster category center points Ci.

In one example, the sample distances between the respective data samples in the first and second data sets that the data owners Alice and Bob have and the respective current cluster class center points may be determined between the respective data owners by performing a multi-party security computation using the regrouped data sets (third and third data sets) of the respective data owners, each having a distance share of the respective sample distances.

FIG. 9 illustrates a flow diagram of a process for determining distance shares of sample distances performed between various data owners, according to an embodiment of the present description. In the example of FIG. 9, only the determination of the distance share of a single data sample X1 with respect to the sample distance of the cluster class center point C1 is shown.

As shown in FIG. 9, at 910, the data owner Alice uses the data shares of the data sample X1 that it has<X1>1 and center point share of cluster class center point C1< C1>1, locally calculating the central point share<C1>1 and data quota<X1>The squared value of the difference of 1, i.e. calculation

The value of (c). Data owner Bob uses the data shares of data sample X1 that he has<X1>2 and center point share of cluster class center point C1< C1>2, calculating the central point share locally<C1>2 and data shares<X1>2 squared value of the difference, i.e. calculating

The value of (c). Here, data sample X1 equals the data quantum<X1>1 and<X1>2, in the first step.

At 920, the data owners Alice and Bob use secret sharing multiplication to compute a center point share<C1>1 and data shares<X1>1 difference of

And fraction of center point< C1>2 and data shares<X1>2 difference of

Product of (2)

Then 2 times the product is found, i.e.,

。

at 930, the obtained

The data owner Alice obtains a share by carrying out fragmentation processing

And the data owner Bob gets another share

。

At 940, the data owner Alice calculates locally

And the fraction obtained

Summing to obtain distance share of sample distance between the data sample X1 and the cluster class center point C1

. Locally calculated by the data owner Bob

And the fraction obtained

. This results in a sample distance between the data sample X1 and the cluster class center point C1.

In the example shown in fig. 9, only the determination of the distance share of the sample distance between the data sample X1 and the cluster class center point C1 is described. According to the same method, the sample distance between each data sample in the data sets owned by the data owners Alice and Bob and the center point of each cluster type can be determined.

Returning to fig. 8, after the sample distances between the data samples in the data sets owned by the data owners Alice and Bob and the cluster type center points are determined as above, at 830, data clustering is performed again on the data samples according to the determined sample distances between the data samples and the current cluster type center points.

In one example, for each data sample in the data set owned by each data owner, the data sample may be data clustered again according to a comparison protocol based on secret sharing using the distance shares each data owner has for each sample distance of the data sample.

As shown in fig. 10, at 1010, the respective sample distances are compared in size according to a secret sharing based comparison protocol using respective distance shares that the respective data owners have for the respective sample distances of the current data sample. How to compare the size of each sample distance according to the distance share of each sample distance based on the comparison protocol of secret sharing can be implemented by any suitable method in the art.

At 1020, the data sample is clustered to the cluster class to which the current cluster class center point with the smallest sample distance belongs according to the sample distance comparison result. Optionally, in one example, the sample distance size comparison result may be a category vector, e.g., for the data sample, the sample distance size comparison result is a K-dimensional category vector L for marking a category to which the data sample belongs. The category vectors are stored at the respective data owners in a secret sharing manner. For example, for data sample X1, its class vector LX1, data owner Alice owns the class vector share < LX1>1, and data owner Bob owns the class vector share < LX1> 2. Here, the class vector shares < LX1>1 and < LX1>2 are both K-dimensional vectors.

At 1030, it is determined whether the current data sample is the last data sample in the data set that both data owners have. If it is the last data sample, the process ends.

If it is not the last data sample, then at 1040, the next data sample is selected and then returns to 1010 to perform the next loop process.

Returning to FIG. 8, at 840, each current cluster category center point is updated based on the re-data clustering results for the data sets (first and second data sets) of each data owner. In one example, each current cluster class center point may be updated using a multi-party security computation based on the class vector shares of each data sample that each data owner has.

The following takes the cluster classmate C1 as an example to illustrate how data owners Alice and Bob use multi-party security computation to perform the current cluster classmate update.

The data owners Alice and Bob cooperatively compute the mean of the data samples classified into the cluster indicated by the cluster type center point C1, denoted as < C1> 1. In this process, since the class vector L exists in a secret sharing form, for example, for the class vector LX1 of the data sample X1, the data owner Alice owns the class vector share < LX1>1, and the data owner Bob owns the class vector share < LX1> 2. For all data samples X, the dimension is n X d, n is the number of data samples, d is a characteristic number, the data owner Alice owns the data share matrix < X >1, and the data owner Bob owns the data share matrix < X > 2. For all data samples X, the class vector LX has dimensions n × K, K is the number of clusters, the data owner Alice owns the class vector share < LX >1, and the data owner Bob owns the class vector share < LX > 2.

Then, a new cluster class center point matrix can be calculated according to the calculation formula LXT/LXT I. Here, I is an n-dimensional vector of all 1's, and LX ^ T is the transpose of LX. Since LX and X both exist in secret sharing, both the numerator LX ^ T ^ X and the denominator LX ^ T ^ I can be multiplied using the secret sharing matrix. In addition, division can be accomplished using an approximate approach, using secret sharing or garbled circuits. And according to the obtained calculation result, the data owners Alice and Bob respectively have a matrix of k x d, namely the new central point matrix.

At 850, it is determined whether the updated center point for each cluster category has changed. If there is a change in at least one cluster class center point, returning to 820, the next round robin process is performed for all elements in the first and second data sets. And using the currently updated center point of each cluster category as the center point of each current cluster category in the next cycle process. In the present specification, for each cluster category center point, whether the center point has changed is determined by calculating a sample distance between the determined center point and the current center point and comparing the calculated sample distance with a predetermined threshold. If the calculated sample distance is less than the predetermined threshold, the center point is considered unchanged. Otherwise, the center point is considered to be changed.

If the central points of all the clustering categories are not changed, the data clustering process is finished, and a data clustering result is output.

The two-party data clustering method based on data privacy protection according to the embodiment of the present specification is described above with reference to fig. 1 to 10.

With the above method, at each data owner, each data sample in each data set is divided into two data shares. Each data owner shares one of the two data shares of each data sample that is cut out to the other data owner. At each data owner, a reconstituted data set of the data owner is obtained based on the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner, respectively. Subsequently, data clustering is performed between the data owners using the regrouped data sets of the data owners. According to the data clustering mode, the data set used for data clustering is a data set recombined by using data shares from two data owners owned by each data owner, so that complete information of any sample data of the two data owners cannot be revealed during data clustering, and a two-party data clustering process based on data privacy protection is realized.

Further, with the above-described two-party data clustering method, by performing multi-party security calculation using the regrouped data sets of the two data owners to determine the sample distance between each data sample in the data sets of the two data owners and the cluster category center point, it is possible to further prevent the privacy data of the data owners from being revealed when the sample distance determination process is performed.

In addition, with the above-described two-party data clustering method, by comparing the sizes of the respective sample distances using a comparison protocol based on secret sharing, it is possible to further prevent leakage of private data or private information at the time of the data clustering process again.

Fig. 11 illustrates a block diagram of a two-partydata clustering apparatus 1100 based on data privacy protection in accordance with an embodiment of the present description. As shown in fig. 11, the two-sideddata clustering apparatus 1100 includes a data segmentation unit 1110, a share sharing unit 1120, a data set reorganization unit 1130, and adata clustering unit 1140. The two-partydata clustering apparatus 1100 shown in fig. 11 is applied to any one of two data owners.

The data slicing unit 1110 is configured to slice each data sample in the data set having into two data shares. The operation of the data slicing unit 1110 may refer to the operation of 310 described above with reference to fig. 3.

The share sharing unit 1120 is configured to share one of the two data shares of the sliced respective data sample to the other data owner. The operation of the share sharing unit 1120 may refer to the operation of 320 described above with reference to FIG. 3.

The data set reorganizing unit 1130 is configured to obtain the reorganized data set of the data owner based on the data shares of each data sample retained by the data owner and the data shares of each data sample obtained from another data owner. The operation of the data set reorganization unit 1130 may refer to the operation of 330 described above with reference to FIG. 3.

Optionally, in an example, the data set owned by each data owner is a horizontally sliced data set, and the data set reorganizing unit 1130 is configured to transversely concatenate the data shares of each data sample retained by the data owner and the data shares of each data sample acquired from another data owner, so as to obtain the reorganized data set of the data owner.

Optionally, in another example, the data set owned by each data owner is a vertically sliced data set, and the data set restructuring unit 1130 is configured to longitudinally concatenate the data shares of each data sample retained by the data owner and the data shares of each data sample acquired from another data owner to obtain the restructured data set of the data owner.

Thedata clustering unit 1140 is configured to perform data clustering between the data owners using the regrouped data sets of the data owners, wherein the regrouped data sets of another data owner are derived based on the data shares of the data samples retained by the other data owner and the data shares of the data samples obtained from the data owner. The operation of thedata clustering unit 1140 may refer to the operation of 340 described above with reference to fig. 3.

Fig. 12 shows a block diagram of a data clustering unit 1200 according to an embodiment of the present specification. As shown in fig. 12, the data clustering unit 1200 includes a sample distance determination module 1210, a data clustering module 1220, and a center point updating module 1230.

When data clustering is performed, the sample distance determining module 1210, the data clustering module 1220 and the center point updating module 1230 execute operations in a loop until the cluster category center point is not changed any more, and when the cluster category center point is changed, the updated cluster category center point is used as the current cluster category center point of the next loop process.

The sample distance determination module 1210 is configured to determine, between the respective data owners, a sample distance between each data sample in the data set owned by each data owner and each current cluster category center point using the reorganized data set of each data owner.

The data clustering module 1220 is configured to perform data clustering again on each data sample according to the determined sample distance between each data sample and each current clustering category center point.

The center point updating module 1230 is configured to update the current cluster category center points according to the re-clustering result.

Optionally, in one example, the sample distance determination module 1210 is configured to perform a multi-party security calculation between the respective data owners using the regrouped data sets of the respective data owners to determine sample distances between respective data samples in the data sets owned by the respective data owners and respective current cluster category center points, the respective data owners having respective distance shares of the respective sample distances.

Optionally, in an example, for each data sample in the data set owned by each data owner, the data clustering module 1220 is configured to perform data clustering again on the data sample according to a comparison protocol based on secret sharing, using distance shares that each data owner has for each sample distance of the data sample.

Optionally, in an example, for each data sample in the data set owned by the respective data owner, the data clustering module 1220 is configured to compare the size of the respective sample distance according to a secret sharing-based comparison protocol using the distance share of the respective data owner for the respective sample distance of the data sample; and according to the comparison result of the sample distance, clustering the data sample to the cluster class to which the current cluster class central point with the minimum sample distance belongs. Further optionally, in one example, the sample distance magnitude comparison may be a category vector that is maintained at the respective data owners in a secret sharing fashion.

Optionally, in one example, the centerpoint update module 1230 is configured to update each current cluster category centerpoint using a multi-party security computation according to a category vector share of each data sample that each data owner has.

As described above with reference to fig. 1 to 12, the two-party data clustering method, the two-party data clustering apparatus, and the two-party data clustering system according to the embodiment of the present specification are described. The above two-party data clustering device can be realized by hardware, or can be realized by software, or a combination of hardware and software.

Fig. 13 is a hardware block diagram of an electronic device 1300 for implementing a two-party data clustering apparatus according to an embodiment of the present specification. As shown in fig. 13, electronic device 1300 may include at least one processor 1310, storage (e.g., non-volatile storage) 1320, memory 1330, andcommunication interface 1340, and the at least one processor 1310, storage 1320, memory 1330, andcommunication interface 1340 are connected together via a bus 1360. The at least one processor 1310 executes at least one computer-readable instruction (i.e., the elements described above as being implemented in software) stored or encoded in memory.

In one embodiment, computer-executable instructions are stored in the memory that, when executed, cause the at least one processor 1310 to: each data sample in the data set is divided into two data shares; sharing one of the two data shares of each of the sliced data samples to the other data owner; obtaining a restructured data set of the data owner based on the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from another data owner; and performing data clustering between the data owners by using the restructuring data sets of the data owners, wherein the restructuring data set of the other data owner is obtained based on the data shares of the data samples reserved by the other data owner and the data shares of the data samples acquired from the data owner.

It should be appreciated that the computer-executable instructions stored in the memory, when executed, cause the at least one processor 1310 to perform the various operations and functions described above in connection with fig. 1-12 in the various embodiments of the present description.

According to one embodiment, a program product, such as a machine-readable medium (e.g., a non-transitory machine-readable medium), is provided. A machine-readable medium may have instructions (i.e., elements described above as being implemented in software) that, when executed by a machine, cause the machine to perform various operations and functions described above in connection with fig. 1-12 in the various embodiments of the present specification. Specifically, a system or apparatus may be provided which is provided with a readable storage medium on which software program code implementing the functions of any of the above embodiments is stored, and causes a computer or processor of the system or apparatus to read out and execute instructions stored in the readable storage medium.

In this case, the program code itself read from the readable medium can realize the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

Examples of the readable storage medium include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or from the cloud via a communications network.

It will be understood by those skilled in the art that various changes and modifications may be made in the above-disclosed embodiments without departing from the spirit of the invention. Accordingly, the scope of the invention should be determined from the following claims.

It should be noted that not all steps and units in the above flows and system structure diagrams are necessary, and some steps or units may be omitted according to actual needs. The execution order of the steps is not fixed, and can be determined as required. The apparatus structures described in the above embodiments may be physical structures or logical structures, that is, some units may be implemented by the same physical entity, or some units may be implemented by a plurality of physical entities, or some units may be implemented by some components in a plurality of independent devices.

In the above embodiments, the hardware units or modules may be implemented mechanically or electrically. For example, a hardware unit, module or processor may comprise permanently dedicated circuitry or logic (such as a dedicated processor, FPGA or ASIC) to perform the corresponding operations. The hardware units or processors may also include programmable logic or circuitry (e.g., a general purpose processor or other programmable processor) that may be temporarily configured by software to perform the corresponding operations. The specific implementation (mechanical, or dedicated permanent, or temporarily set) may be determined based on cost and time considerations.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A two-party data clustering method based on data privacy protection, the two parties including two data owners, each data owner having a data set, the two-party data clustering method applied to the data owners, the two-party data clustering method comprising:

each data sample in the data set is divided into two data shares;

sharing one of the two data shares of each of the cut data samples to another data owner, and obtaining one of the two data shares obtained by the another data owner by cutting each of the data samples in the data set owned by the another data owner;

obtaining a restructured data set of the data owner based on the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner; and

data clustering, with another data owner, using the regrouped data sets of both data owners,

and the reorganization data set of the other data side is reorganized based on the data shares of the data samples reserved by the other data side and the data shares of the data samples acquired from the data side.

2. The two-party data clustering method of claim 1, wherein the data sets that the two data-owning parties have are horizontally sliced data sets,

obtaining the regrouped data set of the data owner based on the data shares of the data samples retained by the data owner and the data shares of the data samples obtained from another data owner comprises:

and transversely splicing the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner to obtain a recombined data set of the data owner.

3. The two-party data clustering method of claim 1, wherein the data sets that the two data-owning parties have are vertically sliced data sets,

deriving the reassembled data set for the data owner based on the data shares for each data sample retained by the data owner and the data shares for each data sample obtained from the other data owner comprises:

and longitudinally splicing the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner to obtain a recombined data set of the data owner.

4. The two-party data clustering method of claim 1, wherein using the regrouped data sets of the two data owners for data clustering with the other data owner comprises:

the following processes are executed in a loop until the cluster category center point is not changed any more:

determining, with another data owner, a sample distance between each data sample in the data set owned by each data owner and each current cluster category center point using the regrouped data set of each data owner;

according to the sample distance between each determined data sample and each current clustering category central point, performing data clustering again on each data sample;

updating the central point of each current clustering category according to the data clustering result again,

when the cluster category center changes, the updated cluster category center point is used as the current cluster category center point of the next cycle process.

5. The two-party data clustering method of claim 4, wherein the sample distance between each data sample in the data set owned by each data owner and each current cluster class center point is between two data owners, each data owner having a distance share of each sample distance, each determined by performing a multi-party security computation using the regrouped data set of each data owner.

6. The method of bi-directional data clustering of claim 5, wherein re-clustering the data samples according to the determined sample distance between each data sample and each current cluster category center point comprises:

for each data sample in the data set owned by each data owner, the data sample is clustered again according to a comparison protocol based on secret sharing using the distance shares for each sample distance of the data sample that each data owner has.

7. The two-party data clustering method of claim 6, wherein for each data sample in the data set owned by the respective data owner, re-clustering the data sample according to a secret sharing based comparison protocol using the distance shares, respectively owned by the respective data owners, for the respective sample distances of the data sample comprises:

for each data sample in the data set owned by the respective data owner,

comparing the sizes of the sample distances according to a comparison protocol based on secret sharing by using the distance shares respectively possessed by the data owners for the sample distances of the data samples; and

and according to the comparison result of the sample distance, clustering the data sample to the cluster class to which the current cluster class central point with the minimum sample distance belongs.

8. The method for clustering data of two parties as claimed in claim 7, wherein the sample distance magnitude comparison result is a category vector, and the category vector is stored at each data-owning party in a secret sharing manner.

9. The two-way data clustering method of claim 8, wherein updating each current cluster category center point according to the re-data clustering result comprises:

updating each current cluster class center point using a multi-party safe computation based on the class vector shares of each data sample that each data owner has.

10. A two-party data clustering apparatus based on data privacy protection, the two parties including two data owners, each data owner having a data set, the two-party data clustering apparatus being applied to the data owners, the two-party data clustering apparatus comprising:

the data segmentation unit is used for segmenting each data sample in the data set into two data shares;

the share sharing unit shares one of the two data shares of each data sample to another data owner, and acquires one of the two data shares obtained by the other data owner by segmenting each data sample in the data set from the other data owner;

the data set reorganization unit is used for obtaining the reorganized data set of the data owner based on the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner; and

and the data clustering unit is used for carrying out data clustering together with the other data owner by using the reorganized data set of each data owner, wherein the reorganized data set of the other data owner is reorganized based on the data shares of each data sample reserved by the other data owner and the data shares of each data sample acquired from the data owner.

11. The two-party data clustering apparatus of claim 10, wherein the data sets that the two data owners have are horizontally sliced data sets,

and the data set reorganizing unit transversely splices the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner to obtain the reorganized data set of the data owner.

12. The two-party data clustering apparatus of claim 10, wherein the data sets that the two data owners have are vertically sliced data sets,

and the data set reorganizing unit longitudinally splices the data shares of the data samples reserved by the data owner and the data shares of the data samples acquired from the other data owner to obtain the reorganized data set of the data owner.

13. The two-way data clustering apparatus of claim 10, wherein the data clustering unit comprises:

the sample distance determining module is used for determining the sample distance between each data sample in the data set owned by each data owner and each current clustering class central point by using the recombined data set of each data owner together with another data owner;

the data clustering module is used for clustering data of each data sample again according to the sample distance between each determined data sample and each current clustering class central point;

a central point updating module for updating the central point of each current cluster category according to the secondary data clustering result,

the sample distance determining module, the data clustering module and the central point updating module are executed in a circulating mode until the clustering category central point is not changed any more, and when the clustering category central point is changed, the updated clustering category central point is used as the current clustering category central point in the next circulating process.

14. The two-party data clustering apparatus of claim 13, wherein the sample distance determination module performs a multi-party security calculation between the respective data owners using the regrouped data sets of the respective data owners to determine sample distances between respective data samples in the data sets owned by the respective data owners and respective current cluster class center points, the respective data owners having respective distance shares of the respective sample distances.

15. The two-party data clustering apparatus of claim 14, wherein for each data sample in the data set owned by the respective data owner, the data clustering module re-clusters the data sample according to a comparison protocol based on secret sharing using the respective distance share for the respective sample distance of the data sample that the respective data owner has.

16. The two-party data clustering apparatus of claim 15, wherein, for each data sample in the data set owned by the respective data owner, the data clustering module compares the magnitude of the respective sample distance according to a secret sharing based comparison protocol using the distance share for the respective sample distance of the data sample that the respective data owner has; and according to the comparison result of the sample distance, clustering the data sample to the cluster class to which the current cluster class central point with the minimum sample distance belongs.

17. The two-party data clustering apparatus of claim 16, wherein the sample distance magnitude comparison is a category vector that is stored at each data-owning party in a secret sharing fashion.

18. The two-way data clustering device of claim 17, wherein the hub update module updates each current clustering hub using a multi-party safe computation based on a class vector share of each data sample that each data owner has.

19. A two-party data clustering system based on data privacy protection, the two-party data clustering system comprising:

a first data owner comprising the two-party data clustering apparatus of any one of claims 10 to 18; and

a second data owner comprising a two-party data clustering arrangement as claimed in any one of claims 10 to 18.

20. An electronic device, comprising:

at least one processor, and

a memory coupled with the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of two-way data clustering of any of claims 1 to 9.

21. A machine-readable storage medium storing executable instructions that, when executed, cause the machine to perform the method of two-way data clustering of any one of claims 1 to 9.