Summary of the invention
The present invention provides a kind of available relatively good image Hash model of an effect based on few training sampleDepth image hash method.
In order to reach above-mentioned technical effect, technical scheme is as follows:
A kind of depth image hash method based on few training sample, comprising the following steps:
S1: task definition and data divide;
S2: the building general depth Hash model of triplet-based;
S3: memory body is supported based on general depth Hash model construction;
S4: sub-network is remembered by two-way shot and long term and memory body is supported to learn the character representation of few sample;
S5: the depth image Hash model under the few sample of training, and retrieval test is carried out to the test set of few sample.
Further, the detailed process of the step S1 is:
S11: by taking cifar100 data set as an example, being specifically defined for few-shot hashing is provided.Cifar100 is drawnIt is divided into 2 parts, first part there are 80 classes, and every one kind there are 500 sufficient trained pictures, is denoted as S (support set);In additionA part has 20 classes, and every a kind of 3 only a small amount of (or 5,10 ..) training samples, which is denoted as L (learningset).Purpose be one depth Hash model of training, enable picture number of the picture in entire 100 class for belonging to this 20 classAccording to the retrieval for carrying out relative efficiency in library.
Further, the detailed process of the step S2 is:
S21: for the task of depth image Hash, it is necessary first to which construction feature learns sub-network, i.e. depth convolutional network(CNN).Convolutional network is formed by convolutional layer, active coating and pond layer heap are folded, has powerful feature representation ability;
S22: after convolution sub-network, each picture is converted to a semantic feature vector, then, after feature vectorFace adds an output neuron quantity to be the full articulamentum and corresponding sigmod activation primitive layer of q.In this way, each image is justReal vector that be converted to q dimension, that range is between 0~1, i.e. Hash vector;
S23: after obtaining Hash vector, being constrained by triple loss function (triplet ranking loss),The purpose of triple loss function just passes through study, allows the distance between the approximate Hash vector of similar pictures will far smaller than notThe distance between Hash vector of similar pictures;
S24: the general depth Hash network of training triplet-based obtains general depth Hash model.
Further, detailed process is as follows by the step S3:
S31: from the task definition of front, data set have 2 parts, a part is S (support set), another partIt is care, is also the few L (learning set) of training sample, every one kind has sufficient training sample in S, can correspond toIn the things met or learnt;Training sample is seldom in L, corresponding to the things newly seen;
S32: feature extraction is carried out to the sample of S with trained triplet-based general depth Hash model.SpecificallyAre as follows: sample I [i] [j] (1≤i≤s, 1≤j≤n, s are the species number of S, and n is the sample number of every one kind) is sequentially inputted to lead toDepth Hash model is crossed, the semantic dimensional feature of each picture is obtained;
S33: being M [i] [j] by all feature permutations, specifically: every a line i is identical, indicates the feature vector category of the rowIn same class, different lines indicate such j-th of sampling feature vectors, M, support memory body (support memory) as;
Further, detailed process is as follows by the step S4:
S41: in each iteration, support can pop up a feature vector, note to each category feature according to specified sequenceMake ft, 1≤t≤s.
S42: the forward and reverse of two-way shot and long term memory sub-network (BLSTM) expands into s time step.
S43: f is enabledlAs two-way shot and long term memory sub-network when constant (non-time-varying, static) it is defeatedEnter xstatic, enable ftThe input x of time-varying (time-varying) as two-way shot and long term memory sub-networkt;
S44: by the interaction of two-way shot and long term memory sub-network and support memory body, the final spy of few new samples is obtainedSign indicates
S45: new character representation is constrained with triple loss function.
Further, detailed process is as follows by the step S5:
S51: whole network is trained with the method for stochastic gradient descent.
The test set of S52:L is retrieved in whole image database, and calculates test result.
Compared with prior art, the beneficial effect of technical solution of the present invention is:
The present invention based on the hash method of deep learning is in a large amount of trained samples in the sum of existing traditional hash methodIt is designed under the premise of this, and in true production environment, it obtains largely marking the cost of training sample very high, instituteWith under few training sample, if can obtain the relatively good image Hash model of an effect is to have very big realityWith value.
Embodiment 1
The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.
1, task definition and data divide
Deep neural network can update whole network when a new training sample arrives from the beginning to the end, if newTraining sample is seldom, then over-fitting will necessarily occur, and effect becomes very poor.However, it was found that when the mankind see new things,The things seen before often associating, such as a child see tiger for the first time, he may search for from memory, hairThis existing new things and the cat often met before are much like, so, he met, and tiger is primary, may just remember the sample of tigerSon.
It is therefrom inspired, and has been applied in image Hash problem: if the training picture of certain things is considerably less,Such as every a kind of only 3,5, one " priori knowledge " of study can be gone by the great amount of samples of other existing things(prior knowledge) either " supports memory " (support memory), then goes to learn from these priori knowledgesNew samples, referred to as few-shot hashing.
Below by taking cifar100 data set as an example, being specifically defined for few-shot hashing is provided.Cifar100 is drawnIt is divided into 2 parts, first part there are 80 classes, and every one kind there are 500 sufficient trained pictures, is denoted as S (support set);In additionA part has 20 classes, and every a kind of 3 only a small amount of (or 5,10 ..) training samples, which is denoted as L (learningset).Purpose be one depth Hash model of training, enable picture number of the picture in entire 100 class for belonging to this 20 classAccording to the retrieval for carrying out relative efficiency in library.
2, the general depth Hash model of triplet-based is constructed
Depth Hash model is widely used in field of image search, such as " scheme to search figure ", " the similar commodity of Taobao are searched " etc.Deng.Moreover, depth Hash model is the foundation of few-shot hashing model, so illustrating depth herein firstHash model is spent, as shown in Figure 1, it is broadly divided into three parts: feature learning sub-network, Hash codes generate sub-network and lossFunction.
A) feature learning sub-network
In image domains, depth convolutional network (CNN) is formed by convolutional layer, active coating and pond layer heap are folded, is had powerfulFeature representation ability, common are AlexNet, GoogLenet, VGG, ResNet etc..Image can be with by convolutional networkIt is converted into feature vector, for example, last pooling layers of 1024 dimensional vectors of GoogLenet or the last layer of VGG connect entirely4096 dimensional vectors for connecing layer all can serve as the character representation of image, and these depth characteristics will be far better than traditional craftFeature, such as GIST feature, SIFT feature.It is illustrated by taking the GoogLenet used as an example below.
B) Hash codes generate sub-network
After CNN, the feature vector that each picture conversion is tieed up for one 1024, and final purpose is to obtain0/1 Hash codes of specific length, such as the Hash codes of 12bit, thus one most intuitively be also most common way be exactlyBehind 1024 dimensional feature vectors plus an output neuron quantity is 12 full articulamentum, then meets a sigmod again below and swashsFunction living.In this way, real vector that each image has been converted to 12 dimensions, that range is between 0~1, it is referred to as approximateHash code vector.
C) triple loss function triplet ranking loss
There are many kinds of the loss functions of depth Hash model, can probably be divided into two major classes, and one kind is pair-based, separatelyOne is triplet-based.Used herein is triplet ranking loss, illustrates triplet in detail belowranking loss。
The input of triplet ranking loss is triplets, i.e. image triple, if any an image data set,I is sample therein, and sim is the similarity between 2 images, if having sim (I, I+) > sim (I, I-), then claim (I, I+, I-) beOne triplet.For example, in the case where single label image data collection: image a and image b belongs to same class, and a and c are notIt is similar, then (a, b, c) is exactly a triplet.
The purpose of triplet ranking loss is exactly to pass through study, allows I and I+Approximate Hash vector between away fromFrom I to be far smaller than and I-The distance between Hash vector, mathematical definition are as follows:
ltri(v(I),v(I+),v(I-))=max (0, m+ | | v (I)-v (I+)||-||v(I)-v(I-)||)
s.t.v(I),v(I+),v(I-)∈[0,1]n
Wherein, v (I) indicates that Hash codes vector approximation, m indicate distance parameter margin.From formula (1) as can be seen that working as IAnd I-The distance between be less than I and I+The distance between and when the sum of margin, the value of loss is greater than zero, will generate loss, justI and I can be widened-Distance, reduce I and I+Distance;As I and I-The distance between be greater than I and I+The distance between and marginThe sum of when, loss zero illustrates that this triplet has succeeded in school.
After depth Hash model training, user submits a picture, which is become by depth Hash modelOne approximate Hash vector, by quantization (Hash vector approximation each to be more than or equal to 0.5 be 1, otherwise for 0), become two intoThe Hash codes of all images carry out Hamming distances calculating in the Hash of system, the Hash codes and database, by all seas after calculatingPrescribed distance sorts from small to large, so that it may which the result for being rapidly returned to top-k is presented to the user.
3, memory body is supported based on general depth Hash model construction
From the task definition of front, data set have 2 parts, a part is S (support set), and another part is to closeL (learning set) heart and that training sample is few.Every one kind has sufficient training sample in S, can correspond to oneThe things that a child had met or learnt;Training sample is seldom in L, corresponding to the things newly seen.It first has to pass through STo construct " priori knowledge " or " supporting memory ".
One triplet-based depth Hash network of training is removed with all data of S, as shown in Figure 1.Because of S trainingSample is sufficient, it is possible to extraordinary to go the parameter for learning depth Hash network and obtain the good Hash mould of an effectType is denoted as support hashing model (SHM).
Then " supporting to remember " is constructed with SHM, specifically: by sample I [i] [j], (1≤i≤s, 1≤j≤n, s are the kind of S500) sample number that class number, such as 80, n are every one kind, is such as sequentially inputted to SHM, as shown in Fig. 2, obtaining each pictureThen all feature permutations are M [i] [j] by 1024 dimensional features (the last one pooling layers), specifically: every a line i phaseTogether, indicate that the feature vector of the row belongs to same class, different lines indicate such j-th of sampling feature vectors, M, assupport memory。
4, sub-network is remembered by two-way shot and long term and memory body is supported to learn the character representation of few sample
This part is the core of few-shot hashing, it will how description is from the middle school support memoryPractise few new samples.
Provide the overall network structure chart of few-shot hashing first, the main distinction be triplet-basedSupport memory and a two-way shot and long term memory sub-network are increased on the basis of depth Hash network, such as following figure instituteShow:
It can be figure it is seen that each new samples I carries out feature by convolution sub-network and mentions when trainingIt takes, feature extremely fl, here it is worth noting that the parameter of the convolution sub-network and SHM be shared and the partial parameters not moreNewly, i.e., the feature extractor that trained SHM is also used to serve as new sample in addition to being used to carry out building support memory.
After feature extraction, a kind of two-way shot and long term memory network BLSTM is devised to carry out new samples and supportThe interaction and study of memory.As shown in Figure 3.
Specifically, M can be according to specified sequence to each category feature pop-up one in each iteration of training stageA feature vector is denoted as ft, 1≤t≤s, meanwhile, the forward and reverse of BLSTM expands into s time step.Then, such as Fig. 3It is shown, enable flThe input x of non-time-varying (static) as BLSTMstatic, enable time- of the ft as BLSTMThe input xt of varying, mathematical form are as follows:
x’t=concat (xt,xstatic)
Wherein concat function is the concatenation of feature vector, for example ft and the fl splicing of 2 1024 dimensions become 2048The vector x of dimension 't.The hidden size of BLSTM is set as 1024 (being consistent with primitive character dimension), by s timeAfter the BLSTM of steps, according to formula (3) de arithmetic operation, each positive LSTM cell can export one 1024 dimensionHft, each reversed LSTM cell can export one 1024 hbt:
Hft=LSTMf (hft-1, xt-1), 1 < t≤s
Hbt=LSTMb (hbt+1, x ' t+1), 1≤t < s
Then the new feature lnew of new samples can be indicated are as follows:
Lsum=eltwise sum (hfs, hb1)
Lnew=eltwise product (hsum, 0.5)
Wherein eltwise_num is the add operation between vector element, and eltwise_product is the multiplication behaviour between elementMake, it is intuitive for, formula (4) be exactly hfs is added and then is taken with hb1 it is average as new character representation.
So far, each new sample obtains one after BLSTM sub-network and support memory interactive learningThe character representation of 1024 new dimensions.
As shown in Fig. 2, after having obtained new character representation, so that it may by same Hash codes generate sub-network andTriplet ranking loss carries out model training.
5, experimental result
1) data set
SUN, 64 class pictures, every a kind of 430 samples, in total 27,520 picture.It is 2 parts: first part by SUN pointsS includes all samples of 54 classes, 2 totally 23200 pictures.Remaining 10 class of second part L is the sample newly learnt, and L is everyClass training sample only have 3,5,10 (herein test few-shot three kinds of situations, be referred to as 3shot, 5shot,10shot).Other than the test sample of the L of L, all samples of S and L form searching database.
CIFAR-10, every one kind sample have 6000, totally 10 class, 60000 pictures.First part's S packet of CIFAR-1048000 samples containing preceding 8 class.2 classes are L after remaining.Likewise, the training sample number of the every class of L has 3 kinds of situations:3shot, 5shot and 10shot.Other than the test sample of L, all samples of S and L form searching database.
CIFAR-100 is similar with CIFAR-10, the difference is that it includes 100 class samples, every a kind of 600 trained samplesThis.Preceding 80 class sample composition, rear 20 class composition.Equally, the every class of the training sample of L only has 3,5,10 these three situations.
2) evaluation index
Select the most common Mean Average Precision (MAP) and Normalized in information retrieval fieldThe evaluation index that Dis-counted Cumulative Gains (NDCG) is used as.The bigger expression retrieval effectiveness of MAP and NDCG is moreIt is good.
3) comparative test
It is below the comparative test on 3 data sets:
MAP experimental result on table 1:SUN data set
MAP experimental result on table 2:CIFAR-10 data set
MAP experimental result on table 3:CIFAR-100 data set
The promotion very big compared to former method of invention as can be seen from the results, the present invention support memory from a large amount ofOr priori knowledge division, the two-way shot and long term memory sub-network of reasonable utilization and memory body is supported just to learn few new samplesCharacter representation, overall network structure of the invention are as shown in Fig. 2.
The same or similar label correspond to the same or similar components;
Described in attached drawing positional relationship for only for illustration, should not be understood as the limitation to this patent;
Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pairThe restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above descriptionTo make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all thisMade any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of inventionProtection scope within.