CN113255841B

Movatterモバイル変換

Info

Publication number: CN113255841B
Application number: CN202110747026.2A
Authority: CN
Inventors: 庄瑞格
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-11-16
Anticipated expiration: 2041-07-02
Also published as: CN113255841A; WO2023273081A1

Abstract

The application discloses a clustering method, a clustering device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a plurality of first image archives, wherein the first image archives comprise a plurality of images to be clustered, and each image to be clustered has an attribute combination; calculating the similarity between the images to be clustered in the first image file, recording the similarity as an intra-class similarity, and constructing intra-class similarity distribution based on the intra-class similarity; calculating the similarity between the images to be clustered with the same attribute combination in all the first image files, recording the similarity as the similarity between classes, and constructing the similarity distribution between the classes based on the similarity between the classes; merging the images to be clustered corresponding to the attribute combination meeting the preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file; and clustering all the second image archives to obtain a clustering result. By means of the method, the accuracy of image clustering can be improved.

Description

Clustering method, clustering device and computer readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a clustering method, a clustering device, and a computer-readable storage medium.

Background

Currently, it is possible to determine whether two images belong to the same clustering target (e.g. human, car or other animal) by intelligent processing technology, such as: the method is characterized in that the method comprises the steps of judging which two face images belong to the same person through a face recognition technology, and judging which faces belong to the same person by using a clustering technology. However, the existing image clustering technology has the main problems that: when a large number of pictures with different scenes and different attributes are faced, the problem that a plurality of clustering targets exist in one clustering file or one clustering target exists in a plurality of clustering files inevitably occurs, namely clustering is inaccurate.

Disclosure of Invention

The application provides a clustering method, a clustering device and a computer readable storage medium, which can improve the accuracy of image clustering.

In order to solve the technical problem, the technical scheme adopted by the application is as follows: there is provided a clustering method, the method comprising: acquiring a plurality of first image archives, wherein the first image archives comprise a plurality of images to be clustered, and each image to be clustered has an attribute combination; calculating the similarity between the images to be clustered in the first image file, recording the similarity as an intra-class similarity, and constructing intra-class similarity distribution based on the intra-class similarity; calculating the similarity between the images to be clustered with the same attribute combination in all the first image files, recording the similarity as the similarity between classes, and constructing the similarity distribution between the classes based on the similarity between the classes; merging the images to be clustered corresponding to the attribute combination meeting the preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file; and clustering all the second image archives to obtain a clustering result.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a clustering device comprising a memory and a processor connected to each other, wherein the memory is used for storing a computer program, and the computer program is used for implementing the clustering method in the above technical solution when being executed by the processor.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a computer-readable storage medium for storing a computer program for implementing the clustering method in the above technical solution when the computer program is executed by a processor.

Through the scheme, the beneficial effects of the application are that: firstly, acquiring a plurality of first image archives, wherein images to be clustered in the first image archives have attribute combinations; processing the images to be clustered in the first image archive to construct intra-class similarity distribution and inter-class similarity distribution of different attribute combinations, and combining the images to be clustered corresponding to the attribute combinations meeting preset combining conditions based on the intra-class similarity distribution and the inter-class similarity distribution to generate at least one second image archive; then, clustering processing is carried out on the second image file to obtain a clustering result, so that offline image clustering is realized; due to the fact that the difference of the attribute combination of the images is considered, the difference of the similarity distribution caused by different attribute combinations can be smoothed, the problem that the recall rate or the accuracy rate is poor caused by the different attribute combinations is effectively solved, and the clustering accuracy rate is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a clustering method provided herein;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a clustering method provided herein;

FIG. 3 is a schematic flow chart ofstep 210 in the embodiment shown in FIG. 2;

FIG. 4 is a schematic structural diagram of an embodiment of a clustering device provided in the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, a large number of faces can be clustered through a face clustering technology, and face photos belonging to the same person are classified into one class, namely one person and one file, which is an ideal situation; but in practice, there are many cases of "one person with multiple files" (i.e., one person may have multiple files) or "one file with multiple people" (i.e., one file may have facial photos of multiple different people) during the clustering process.

The main reasons for the above problems are: the distribution of the similarity of the pictures in the files of different persons is greatly different, for example, in the actual data, the error rate of the files of the categories such as the aged, the child or the women wearing the mask is higher than that of the files of the adults; for these data, the difficulty of grouping them into first gear should be increased, which can reduce false alarms. Although some schemes propose to improve accuracy and recall ratio by adopting a dynamic threshold mode, the distribution difference of face attribute information and face similarity under different attributes is not considered; in the actual clustering, under the same threshold, the false alarm rate of children and women wearing masks is much higher than that of adult men; generally, a threshold of 92 points is set to well divide the face of a male, but a high similarity is easy to occur between different children and women wearing masks, and the threshold may need to be set to 95 points or even higher for children and women wearing masks. However, if the threshold is set high, a large number of recallable adult male pictures are lost, which in turn causes a significant "one person multiple file" problem.

Another reason is that: due to the reasons of angle, face quality or pixel size and the like, the face pictures of two different people have too high similarity under small probability, and further the face pictures of two people appear in one file. Due to the influence of the low-quality pictures, more and more human face pictures corresponding to the low-quality pictures are aggregated into the archive, and the original single noise picture (i.e. the human face image of the person not in the archive) is expanded into a plurality of noise pictures. For convenience of description, a single noise picture existing in one archive is referred to as single-point noise; if there are multiple noise pictures within an archive, it is called blob noise.

The concept of attribute combination used in the present application is described below, where the attribute combination is a combination of attributes of images to be clustered, for example, taking the images to be clustered as face images, the attribute combination includes race, age, gender, whether to wear a mask or wear glasses, further, if the images are divided according to the age attributes of the faces, the images can be divided into old age, middle age, young age and children, the images are divided according to the gender attributes of the faces, and the images can be divided into men and women, and if the images are worn a mask, then 16 combination attributes can be obtained, and if the images are further subdivided according to other face attributes, more attribute combinations can be obtained.

In order to solve the problem of similarity distribution difference among different attribute combinations, the intra-class similarity distribution and the inter-class similarity distribution of the different attribute combinations are constructed on the basis of a marked first image file, then the similarity distribution (including the intra-class similarity distribution and the inter-class similarity distribution) of the different attribute combinations is merged, and then clustering is carried out; by combining the similarity distribution of different attribute combinations, the problem that a plurality of clustering targets exist in one archive can be effectively solved, and the problem that the same clustering target has a plurality of archives is relieved to a certain extent.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a clustering method provided in the present application, the method including:

step 11: a plurality of first image files are obtained.

Firstly, acquiring a plurality of first image archives, wherein each first image archive comprises a plurality of images to be clustered, each image to be clustered is provided with an attribute combination, and the attribute combinations of the images to be clustered of different types are possibly different; for example, assuming that the image to be clustered is a face image, the attribute combination includes the race, age, sex, whether to wear a mask or wear glasses; if the image to be clustered is a license plate image, the attribute combination can be the shape of the license plate, the color of the license plate or the number of the license plate. Specifically, the first image archive may be created by manual labeling, and the program developer identifies each image to be clustered to divide the images to be clustered belonging to the same class into the same group, thereby obtaining a plurality of first image archives. It will be appreciated that other ways of obtaining the first image archive may be used, such as: and directly obtaining the classification result of other clustering models from the image database or using the classification result of other clustering models as the first image archive.

Furthermore, each first image archive also has attribute combinations, and the number of the images to be clustered corresponding to different attribute combinations in the first image archives can be counted; the attribute combination with the largest number is used as the attribute combination of the first image file. For example, taking the image to be clustered as the facial image as an example, for a first image file labeled as the same person, the attribute combination of each facial image in the file can be counted, and the attribute combination with the largest occurrence number is considered as the attribute combination of the file.

Step 12: and calculating the similarity between the images to be clustered in the first image file, recording the similarity as the intra-similarity, and constructing intra-similarity distribution based on the intra-similarity.

After the first image archives are acquired, the images to be clustered in each first image archive may be processed, such as: the method comprises the steps of extracting the features of images to be clustered by adopting a feature extraction method, calculating the similarity between the features of any two images to be clustered, recording the similarity as the intra-class (namely the inside of a file), and then constructing intra-class similarity distribution by utilizing all the intra-class similarities.

Step 13: and calculating the similarity between the images to be clustered with the same attribute combination in all the first image archives, recording the similarity as the similarity between classes, and constructing the similarity distribution between the classes based on the similarity between the classes.

After the attribute combination of each first image archive is obtained, merging all the first image combinations according to the attribute combination, that is, merging at least two first image archives with the same attribute combination, wherein the merged image archives can be recorded as attribute image archives, and then processing images to be clustered in the attribute image archives, for example: extracting the features of the images to be clustered, calculating the similarity between the features of any two images to be clustered in the attribute image archive, recording the similarity between classes (namely between archives), and then constructing the similarity distribution between the classes by utilizing the similarity between all the classes.

Through the above processing, the final result is that each attribute combination corresponds to a similarity distribution; suppose there are k (k ≥ 1) attribute combinations D₁、D₂、D₃、……、D_kThen there are k distribution of similarity F in class₁、F₂、F₃、……、F_k，F₁、F₂、F₃、……、F_kRespectively representing the similarity in the similarity under different attribute combinations; similarly, the similarity between different archives under the same attribute combination can be calculated to obtain the similarity distribution among k classes

。

In a specific embodiment, in order to construct the intra-class similarity distribution and the inter-class similarity distribution, the preset similarity interval may be divided into a preset number of intervals, and the preset similarity interval may be [0, 1] or [0%, 100%) ]); respectively counting the number of the similarity in each interval and the similarity between the intervals, and recording the number as a first number and a second number; dividing the first number by the total number of the similar degrees in the class to obtain a probability value corresponding to the similar degrees in the class, and constructing the distribution of the similar degrees in the class based on all the similar degrees in the class and the corresponding probability values; and dividing the second number by the total number of the inter-class similarity to obtain a probability value corresponding to the inter-class similarity, and constructing inter-class similarity distribution based on all the inter-class similarities and the corresponding probability values.

For example, assuming that the first image file includes M1 images, the similarity between any two images to be clustered in the first image file is calculated to obtain M2 intra-class similarities, the preset similarity interval is [0, 1], and the preset similarity interval is divided into 10 intervals: 0-0.1, 0.1-0.2, … …, 0.9-1, calculating the probability that the M2 intra-class similarities fall in each interval, for example, assuming that the probability of falling in the interval (0.9, 1) is 0.9 and the probability of falling in the interval (0.8, 0.9) is 0.2, thereby obtaining the correspondence between the intra-class similarities and the probabilities.

Step 14: and merging the images to be clustered corresponding to the attribute combination meeting the preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file.

The preset merging condition is a condition whether two preset attribute combinations can be merged, after the intra-class similarity distribution and the inter-class similarity distribution are obtained, the images to be clustered corresponding to the attribute combinations with similar/identical similarity distributions can be merged together based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file, and therefore the number of the second image files is smaller than that of the first image files.

It can be understood that if the similarity distributions corresponding to any two attribute combinations do not satisfy the preset merging condition, thestep 15 is directly executed without merging processing.

Step 15: and clustering all the second image archives to obtain a clustering result.

Clustering at least one second image file by adopting a clustering method to obtain a corresponding clustering result and finish clustering; specifically, one clustering method may be used for clustering once, or multiple clustering methods may be used for clustering, for example: firstly, a clustering method (such as a hierarchical clustering method or a density clustering method) is used for clustering, the obtained clustering class is used as an initial clustering class when clustering is carried out for the second time, and then another clustering method (such as a K-means clustering method) is used for carrying out secondary clustering.

It can be understood that, when the image to be clustered is a face image, the clustering method provided by the embodiment can be applied to the technical field of face recognition.

The embodiment provides a clustering method based on similarity distribution of attribute combinations of images to be clustered, which includes the steps of firstly constructing similarity distribution of different attribute combinations (namely constructing a mapping relation between the attribute combinations and the similarity distribution), then combining the images to be clustered corresponding to the attribute combinations with similar/same similarity distribution, generating at least one second image archive, then carrying out clustering processing on all the second image archives to obtain clustering results, and realizing offline image clustering; due to the fact that attribute combination of the images is considered, difference of similarity distribution caused by different attribute combinations can be smoothed, the problem that recall rate or accuracy rate is poor due to different attribute combinations is effectively relieved, and improvement of clustering accuracy rate and recall rate is facilitated.

Referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of a clustering method provided in the present application, the method including:

step 201: a plurality of first image files are obtained.

Step 202: and calculating the similarity between the images to be clustered in the first image file, recording the similarity as the intra-similarity, and constructing intra-similarity distribution based on the intra-similarity.

Step 203: and calculating the similarity between the images to be clustered with the same attribute combination in all the first image archives, recording the similarity as the similarity between classes, and constructing the similarity distribution between the classes based on the similarity between the classes.

Steps 201 to 203 are the same assteps 11 to 13 in the above embodiment, and are not described again.

Step 204: and selecting two attribute combinations from all the attribute combinations as a first attribute combination and a second attribute combination, acquiring the intra-class similarity and the inter-class similarity of the first attribute combination, and acquiring the intra-class similarity and the inter-class similarity of the second attribute combination.

Each attribute combination has corresponding intra-class similarity distribution and inter-class similarity distribution, different attribute combinations may have similar similarity distribution, and in order to determine whether the similarity distributions corresponding to the attribute combinations are similar, the similarity of the similarity distributions of any two attribute combinations may be compared; further, the first attribute combination and the second attribute combination may be selected in a randomly selected order, or an order identifier may be created for each attribute combination, and the first attribute combination and the second attribute combination may be selected in sequence according to the order identifier.

Step 205: and calculating the similarity between the intra-class similarity of the first attribute combination and the intra-class similarity of the second attribute combination to obtain a first distribution similarity, and calculating the similarity between the inter-class similarity of the first attribute combination and the inter-class similarity of the second attribute combination to obtain a second distribution similarity.

After the similarity distribution of the first attribute combination and the similarity distribution of the second attribute combination are obtained, the similarity of the intra-class similarity distribution and the similarity of the inter-class similarity distribution can be respectively compared, namely the similarity between the intra-class similarity of the first attribute combination and the intra-class similarity of the second attribute combination is calculated; and calculating the similarity between the inter-class similarity of the first attribute combination and the inter-class similarity of the second attribute combination.

Step 206: and judging whether the first distribution similarity and the second distribution similarity meet a preset merging condition or not.

After the first distribution similarity and the second distribution similarity are calculated, a measurement method of the distribution similarities can be adopted to determine whether the two distribution similarities satisfy a preset combination condition, such as: KL (Kullback-Leibler) divergence, F-divergence or Wasserstein distance.

In a specific embodiment, the KL divergence is used to measure the difference between two distributions, i.e., the first distribution similarity includes a first intra-class divergence and a second intra-class divergence, and the second distribution similarity includes a first inter-class divergence and a second inter-class divergence; firstly, calculating KL divergence of the similarity of the first attribute combination and the similarity of the second attribute combination to obtain first class interior divergence; calculating KL divergence of the similarity of the second attribute combination and the similarity of the first attribute combination to obtain second class divergence; calculating KL divergence of the similarity between the classes of the first attribute combination and the similarity between the classes of the second attribute combination to obtain first inter-class divergence; calculating KL divergence of the similarity between the classes of the second attribute combination and the similarity between the classes of the first attribute combination to obtain second inter-class divergence; then, whether the first-class internal divergence is smaller than a first preset value, whether the second-class internal divergence is smaller than a first preset value, whether the first-class inter-divergence is smaller than a second preset value and whether the second-class inter-divergence is smaller than a second preset value are judged, namely whether two similarity distributions are combined is judged through the following formula:

wherein, F_iAs a first combination of attributes, F_jI is more than or equal to 1, j is more than or equal to k, k is the total number of the attribute combinations corresponding to all the first image files, and i is not equal to j; threshold (THRESHOLD)_KLIn order to be at the first preset value,

the first preset value and the second preset value are preset threshold values;

is a first type of internal divergence and is,

in order to be the second type of internal divergence,

in order to be the first type of inter-divergence,

is the second type of interspersion.

And repeatedly executing the steps until all attribute combinations are traversed.

Step 207: and if the preset combination condition is met, combining the image to be clustered corresponding to the first attribute combination with the image to be clustered corresponding to the second attribute combination to obtain a second image file.

If the similarity distribution of the first attribute combination and the similarity distribution of the second attribute combination meet the preset merging condition, F is set_jAnd F_iAre combined into the same distribution, will

And

are combined into the same distribution. After KL divergence merge, there are k attribute combinations D₁、D₂、D₃、……、D_kCorresponding to d distribution of degree of similarity F within the class₁、F₂、F₃、……、F_dThere is a distribution of similarity between d classes

And d is smaller than k, namely, a plurality of attribute combinations correspond to the same similarity distribution.

After merging according to the similarity of the attribute combination, first clustering can be carried out, and whether cluster noise exists in the archive or not is judged; then, carrying out second clustering and filtering cluster noise; then, abnormal noise detection is performed to filter the single-point noise, as shown in steps 208-212.

Step 208: and taking the second image file as the current file to be clustered.

Step 209: and clustering the current archives to be clustered by adopting a first preset clustering algorithm to obtain a plurality of third image archives.

A first preset clustering algorithm may be adopted in the first clustering, and the first preset clustering algorithm may be a density clustering algorithm, and the density clustering algorithm includes: based on a high-Density-Based Spatial clustering of Application with Noise (DBSCAN), an Incremental-Based-Density-Based clustering Algorithm (IGDCA), an object ordering-identifying clustering Structure Algorithm (options), or a clustering Algorithm Based on a most incoherent Core point set (the targeted set of Nov-Cored Core Points, LSNCCP), the present example generates a first-Order clustering archive by using a DBSCAN Density clustering Algorithm.

Step 210: and determining whether a preset clustering termination condition is met or not based on the intra-class similarity distribution of the third image archive and the inter-class similarity distribution of the third image archive.

If the intra-class similarity distribution of the third image file and the inter-class similarity distribution of the third image file satisfy the preset clustering termination condition, it indicates that clustering is finished, and the filtering process of the single-point noise can be performed, i.e.,step 212 is executed.

In a specific embodiment, as shown in fig. 3, the following steps are adopted to determine whether the preset clustering termination condition is satisfied:

step 31: and acquiring the attribute combination of the third image file and recording the attribute combination as the current attribute combination.

For each third image archive, counting the number of images to be clustered corresponding to each attribute combination in the third image archive, assuming that the number of the images to be clustered in the archive is N, and k attribute combinations D exist₁、D₂、D₃、……、D_kThe number of images corresponding to each attribute combination is N₁、N₂、N₃、……、N_kAssume N_iIs N₁、N₂、N₃、……、N_kMaximum value of Limax, then D_iThe combination of attributes for this profile.

Step 32: and obtaining the intra-class similarity distribution and the inter-class similarity distribution of the current attribute combination.

Finding a current property groupCombined intra-analog similarity distribution F_iDistribution of similarity with classes

。

Step 33: and calculating the intra-class probability and the inter-class probability by using the intra-class similarity distribution of the current attribute combination and the inter-class similarity distribution of the current attribute combination.

The intra-class similarity distribution is constructed based on the intra-class similarity of the attribute combination and the corresponding probability, and the probability values corresponding to all the intra-class similarities corresponding to the current attribute combination are multiplied to obtain the intra-class probability; the inter-class similarity distribution is constructed based on the inter-class similarity of the attribute combination and the corresponding probability, and the probability values corresponding to all the inter-class similarities corresponding to the current attribute combination are multiplied to obtain the inter-class probability.

Further, for any one similarity p, the probability of the similarity in the similarity distribution can be used separately

And

and (4) showing. Since the number of images in the archive is N, then this is present

To cluster images, i.e. presence

(ii) individual similarity; if p is used₁、p₂、p₃、……、

To express the similarity, the probability of the similarity occurrence can be calculated by the probability of the single similarity occurrence, i.e. the probability that all the images to be clustered in the archive are the same clustering target (i.e. the intra-class probability)

The probability (i.e. inter-class probability) that all the images to be clustered in the file are different clustering targets

. The smaller the intra-class probability P, the higher the probability of the presence of the clique noise in the archive, and the larger the inter-class probability P', the higher the probability of the presence of the clique noise in the archive.

Step 34: and judging whether the intra-class probability and the inter-class probability meet a preset clustering termination condition.

Judging whether the intra-class probability is greater than or equal to a first preset probability threshold value or not, and whether the inter-class probability is less than or equal to a second preset probability threshold value or not, wherein the first preset probability threshold value and the second preset probability threshold value are preset probability threshold values; and if the intra-class probability is smaller than a first preset probability threshold or the inter-class probability is larger than a second preset probability threshold, determining that cluster noise exists in the current file to be clustered, and clustering the file with the cluster noise for the second time.

In a specific embodiment, the second preset clustering algorithm is a K-means clustering algorithm, and when cluster noise exists in the third image archive, the K-means clustering algorithm is adopted to perform clustering processing on the third image archive; when there is no cluster noise in the third image file, since there is still a possibility of single-point noise, it can be determined whether there is single-point noise in the third image file; and if the third image file has the single-point noise, filtering the single-point noise. It is understood that the single-point noise filtering process may also be performed directly without performing the single-point noise detection operation, i.e.,step 212 is performed.

Step 211: and if the preset clustering termination condition is not met, clustering the third image archives by adopting a second preset clustering algorithm to obtain a plurality of fourth image archives, and taking the fourth image archives as the current archives to be clustered.

If the preset clustering termination condition is not met, the possibility that the third image archive has the cluster noise is high, and considering that the archive is the result of primary density clustering, the quantity of the cluster noise is not large, at the moment, a second preset clustering algorithm can be adopted to cluster the third image archive to obtain a plurality of fourth image archives, the fourth image archive is taken as the current archive to be clustered, the first preset clustering algorithm is adopted to perform clustering processing on the current archive to be clustered, and then thestep 209 is returned to. Taking a face image as an example, after density clustering once, there may be a situation of "one file with many people" in a file, and considering the probability of false alarm, the number of different people in one file generally exceeds three people.

In consideration of the above situation, performing secondary filtering by adopting a K-means clustering algorithm; further, in consideration of the amount of the cluster noise, the parameter K of the K-means is set to 2 (that is, the number of the fourth image files is 2), so that two purer files are formed after K-means clustering, and the file a of the face image is taken as an example for description below.

The archives A form archives A1 and A2 after being clustered by the K mean value, and the probability that the images to be clustered in the archives A are the same person is P_AThe probability that the image to be clustered in the archive A is different person is

。

Suppose that the probabilities of the same person and different persons being the images to be clustered in the archive A1 are P_A1And

the probabilities that the images to be clustered in the file A2 are the same person and different persons are respectively P_A2And

apparently has P_A<P_A1And P_A<P_A2Therefore, the files can be ensured to be purer after being subjected to K-means clustering, but the intra-class probability and the inter-class probability of the files A1 and A2 can not be ensured to meet the preset clustering termination condition.

If the intra-class probability and the inter-class probability of the archive a1 satisfy the preset clustering termination condition, go to step 212; if the preset clustering termination condition is not met, thestep 209 is entered again for iteration. Similarly, profile A2 does the same, which ensures that all profiles are relatively clean and that there is little likelihood of cluttering noise whenstep 212 is entered.

Step 212: and carrying out filtering processing on the single-point noise.

When entering the single point noise filtering step, all the files are high probability files, but there is still a possibility of single point noise. For a face image, single-point noise may be caused by a plurality of problems such as the angle of the face, face occlusion, image pixel or image quality. For single-point noise, the present embodiment filters the single-point noise by using the following two schemes:

1) calculating the distribution of the similarity in the third image file, and recording the distribution as the distribution of the similarity in the first image file; calculating the similarity between each image to be clustered in the third image file and other images in the third image file, calculating the intra-class similarity distribution based on the similarity, and recording the intra-class similarity distribution as a second intra-class similarity distribution; calculating the similarity between the first and second internal similarity distributions, and recording the similarity as the distribution similarity; and taking the image to be clustered corresponding to the maximum value in all the distribution similarity as single-point noise, and deleting the single-point noise from the third image archive.

Since the single-point noise is essentially an outlier within the file, the similarity between the single-point noise and other points within the file exhibits a different distribution. Assuming that the intra-similarity distribution of the archive is F, N images to be clustered exist in the archive, the similarity of each image to be clustered and other images to be clustered is calculated, so that (N-1) similarities can be obtained, and the (N-1) similarities form the intra-similarity distribution. Repeating similar operation, each image to be clustered can obtain an intra-class similarity distribution, and finally an intra-class similarity distribution F is obtained₁、F₂、F₃、……、F_N。

For each intra-class similarity distribution, its difference from the overall intra-class similarity distribution F of the archive is calculated, such as: and measuring the distribution difference of the similarity degrees in the similarity by adopting KL, and filtering the image to be clustered with the maximum distribution difference of the similarity degrees in the similarity as single-point noise.

2) And filtering the third image file by adopting an abnormal value detection (also called outlier detection) algorithm so as to filter the single-point noise in the third image file.

Outlier detection is a detection process that finds objects whose behavior is very different from the expected ones, which are called outliers or outliers. The outlier detection algorithm has specific application in real life, such as credit card fraud, industrial loss detection or image detection. For single-point noise in an image cluster, an outlier detection algorithm can be used for detection, such as: and filtering the single-point noise by adopting an isolated Forest algorithm or a Random Cut Forest (RCF) algorithm.

It will be appreciated that for a detected single point of noise, a file may be created and placed in the file for subsequent use.

In the embodiment, the probability of the similarity is adopted to infer the intra-class probability and the inter-class probability of the file, so that the possibility of noise existing in one file is judged. For the cluster noise, a mode of combining DBSCAAN density clustering and K-means clustering is adopted, and high recall and K-means iterative clustering of DBSCAN are utilized, so that the cluster noise can be filtered, and the cluster noise in the archive can be filtered as far as possible. After the first filtering, the similarity distribution difference and the abnormal point detection algorithm are introduced to detect the single-point noise, so that purer files are obtained, and the problem that a plurality of clustering targets exist in one file is effectively solved.

In summary, in order to solve the problem of similarity distribution difference between different attribute combinations, the intra-class similarity distribution and the inter-class similarity distribution of different attribute combinations are constructed based on the labeled files, and then the similarity distributions of the different attribute combinations are classified by KL divergence. When clustering is carried out, a density clustering algorithm is adopted to complete first clustering, then the overall attribute combination of the file is judged by counting the attribute combinations of all the images in the file, the probability that all the facial images in the file are the same person and the probability that all the facial images are not the same person can be obtained through the similarity distribution corresponding to the attribute combination and the similarity in the file, and the K-means clustering in the file is carried out on the file of which the probability value does not meet the preset clustering termination condition, so that the cluster noise can be effectively filtered by the method. And finally, filtering the single-point noise in the files by taking each file as a unit and adopting a similarity distribution difference or anomaly detection algorithm of the single-point noise. By combining the similarity distribution of different attribute combinations and using two layers of noise filtering, the problem of 'one-grade and multi-people' can be effectively solved, and the problem of 'one-person and multi-grade' is relieved to a certain extent.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an embodiment of a clustering apparatus provided in the present application, theclustering apparatus 40 includes amemory 41 and aprocessor 42 connected to each other, thememory 41 is used for storing a computer program, and the computer program is used for implementing the clustering method in the foregoing embodiment when being executed by theprocessor 42.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application, the computer-readable storage medium 50 is used for storing acomputer program 51, and thecomputer program 51 is used for implementing the clustering method in the foregoing embodiment when being executed by a processor.

The computerreadable storage medium 50 may be a server, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various media capable of storing program codes.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A clustering method, comprising:

acquiring a plurality of first image archives, wherein the first image archives comprise a plurality of images to be clustered marked as the same class, and each image to be clustered has an attribute combination;

calculating the similarity between the images to be clustered in the first image file, recording the similarity as an intra-class similarity, and constructing intra-class similarity distribution based on the intra-class similarity;

calculating the similarity between the images to be clustered with the same attribute combination in all the first image files, recording the similarity as the similarity between classes, and constructing the similarity distribution between the classes based on the similarity between the classes;

merging the images to be clustered corresponding to the attribute combination meeting the preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file;

and clustering all the second image archives to obtain a clustering result.

2. The clustering method according to claim 1, wherein the step of clustering all the second image files to obtain a clustering result comprises:

taking the second image archive as a current archive to be clustered;

clustering the current archives to be clustered by adopting a first preset clustering algorithm to obtain a plurality of third image archives;

determining whether a preset clustering termination condition is satisfied based on the intra-class similarity distribution of the third image archive and the inter-class similarity distribution of the third image archive;

if not, clustering the third image archives by adopting a second preset clustering algorithm to obtain a plurality of fourth image archives, taking the fourth image archives as the current archives to be clustered, and returning to the step of clustering the current archives to be clustered by adopting the first preset clustering algorithm.

3. The clustering method according to claim 2, characterized in that the method comprises:

acquiring an attribute combination of the third image file and recording the attribute combination as a current attribute combination;

obtaining the intra-class similarity distribution and the inter-class similarity distribution of the current attribute combination;

calculating the intra-class probability and the inter-class probability by using the intra-class similarity distribution of the current attribute combination and the inter-class similarity distribution of the current attribute combination;

and judging whether the intra-class probability and the inter-class probability meet the preset clustering termination condition or not.

4. The clustering method according to claim 3, wherein the intra-class similarity distribution is constructed based on the intra-class similarity and the corresponding probability of the attribute combination, the inter-class similarity distribution is constructed based on the inter-class similarity and the corresponding probability of the attribute combination, and the step of calculating the intra-class probability and the inter-class probability by using the intra-class similarity distribution of the current attribute combination and the inter-class similarity distribution of the current attribute combination comprises:

multiplying the probability values corresponding to all the intra-class similarities corresponding to the current attribute combination to obtain the intra-class probability;

and multiplying the probability values corresponding to all the inter-class similarities corresponding to the current attribute combination to obtain the inter-class probability.

5. The clustering method according to claim 4, characterized in that the method further comprises:

dividing the preset similarity interval into intervals with preset number;

counting the number of the similarity between the intra-class similarity and the inter-class similarity in each interval respectively, and recording the number as a first number and a second number;

dividing the first number by the total number of the intra-class similarities to obtain a probability value corresponding to the intra-class similarities, and constructing the intra-class similarity distribution based on all the intra-class similarities and the corresponding probability values;

and dividing the second number with the total number of the inter-class similarity to obtain a probability value corresponding to the inter-class similarity, and constructing the inter-class similarity distribution based on all the inter-class similarities and the corresponding probability values.

6. The clustering method according to claim 3, wherein the step of determining whether the intra-class probability and the inter-class probability satisfy the preset clustering termination condition comprises:

judging whether the intra-class probability is greater than or equal to a first preset probability threshold value or not and whether the inter-class probability is less than or equal to a second preset probability threshold value or not;

if not, determining that cluster noise exists in the current archive to be clustered.

7. The clustering method according to claim 6, wherein the first preset clustering algorithm is a density clustering algorithm, the second preset clustering algorithm is a K-means clustering algorithm, the method further comprising:

when cluster noise exists in the third image archive, clustering the third image archive by adopting the K-means clustering algorithm;

and when the cluster noise does not exist in the third image file, filtering the single-point noise.

8. The clustering method according to claim 7, wherein the step of filtering the single-point noise comprises:

calculating the distribution of the similarity in the third image file, and recording the distribution as the distribution of the similarity in the first image file;

calculating the similarity between each image to be clustered in the third image file and other images in the third image file, calculating the intra-class similarity distribution based on the similarity, and recording the intra-class similarity distribution as a second intra-class similarity distribution;

calculating the similarity between the first and second intra-class similarity distributions, and recording the similarity as the distribution similarity;

and taking the image to be clustered corresponding to the maximum value in all the distribution similarity degrees as the single-point noise, and deleting the single-point noise from the third image archive.

9. The clustering method according to claim 7, wherein the step of filtering the single point noise further comprises:

and filtering the third image file by adopting an abnormal value detection algorithm so as to filter the single-point noise in the third image file.

10. The clustering method according to claim 1, wherein the step of merging the images to be clustered corresponding to the attribute combination satisfying a preset merging condition based on the intra-class similarity distribution and the inter-class similarity distribution to obtain at least one second image file comprises:

selecting two attribute combinations from all the attribute combinations as a first attribute combination and a second attribute combination;

acquiring the intra-class similarity and the inter-class similarity of the first attribute combination, and acquiring the intra-class similarity and the inter-class similarity of the second attribute combination;

calculating the similarity between the similar internal similarity of the first attribute combination and the similar internal similarity of the second attribute combination to obtain a first distribution similarity;

calculating the similarity between the inter-class similarity of the first attribute combination and the inter-class similarity of the second attribute combination to obtain a second distribution similarity;

judging whether the first distribution similarity and the second distribution similarity meet a preset merging condition or not;

if so, merging the images to be clustered corresponding to the first attribute combination with the images to be clustered corresponding to the second attribute combination to obtain a second image file;

and repeatedly executing the steps until all the attribute combinations are traversed.

11. The clustering method of claim 10, wherein the first distribution similarity comprises a first intra-class divergence and a second intra-class divergence, and the second distribution similarity comprises a first inter-class divergence and a second inter-class divergence, the method further comprising:

calculating KL divergence of the internal similarity of the first attribute combination and the internal similarity of the second attribute combination to obtain the first internal divergence;

calculating KL divergence of the internal similarity of the second attribute combination and the internal similarity of the first attribute combination to obtain second internal divergence;

calculating KL divergence of the inter-class similarity of the first attribute combination and the inter-class similarity of the second attribute combination to obtain the first inter-class divergence;

and calculating KL divergence of the similarity between the classes of the second attribute combination and the similarity between the classes of the first attribute combination to obtain the second inter-class divergence.

12. The clustering method according to claim 11, wherein the step of determining whether the first distribution similarity and the second distribution similarity satisfy a preset merging condition comprises:

and judging whether the first-class internal divergence is smaller than a first preset value, whether the second-class internal divergence is smaller than the first preset value, whether the first-class inter-divergence is smaller than a second preset value and whether the second-class inter-divergence is smaller than the second preset value.

13. The clustering method according to claim 1, characterized in that the method further comprises:

counting the number of the images to be clustered corresponding to different attribute combinations in the first image file;

and taking the attribute combination with the maximum number as the attribute combination of the first image file.

14. The clustering method according to claim 1,

the image to be clustered is a face image, and the attribute combination comprises the race, age and sex, and whether a mask is worn or glasses are worn.

15. A clustering device, comprising a memory and a processor connected to each other, wherein the memory is configured to store a computer program, which when executed by the processor is configured to implement the clustering method according to any one of claims 1 to 14.

16. A computer-readable storage medium for storing a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the clustering method according to any one of claims 1 to 14.