Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
The image retrieval method of the present invention, as shown in fig. 1, includes the steps of:
step S101, receiving a query picture and/or a query text submitted by a user;
s102, extracting various content characteristics of the query picture, and segmenting the query text;
s103, comparing various content characteristics of the inquired pictures with corresponding content characteristics of each picture in a database, and sequencing the pictures in the database according to the similarity to obtain various lists of content similarity; comparing the query text after word segmentation with descriptive documents corresponding to each picture in the database, and sequencing the pictures in the database according to the similarity to obtain a text similarity list;
and step S104, assigning a score to each picture in the database according to the position in each list and the weight of the list, reordering the pictures according to the assigned scores to obtain a comprehensive similarity ranking list, and returning the list to the user.
In a traditional retrieval method, retrieval is performed according to text description information submitted by a user, or a feature is extracted from a picture submitted by the user for retrieval, namely, single-mode retrieval. By adopting the retrieval method, the user can perform retrieval only according to the picture or text description information and can perform combined retrieval simultaneously according to the picture and text description information. In the case where the user submits only pictures, as described in step S102, the present retrieval method extracts not only one content feature, but also a plurality of content features, and performs comprehensive ranking. In conclusion, compared with the traditional retrieval method, the retrieval method is a multi-modal mixed retrieval method. Experiments prove that the mixed retrieval mechanism is greatly improved in the aspect of returning result accuracy compared with the conventional single-mode retrieval mechanism. The above steps are described in detail below.
After the user submits the query information, in step S102, content features are extracted from the submitted query image, and the submitted text is segmented. In the embodiment of the present invention, the image content features preferably include color features, texture features and shape features, which are features that are commonly used at present and reflect the contents of the image more typically. The approach taken by the word segmentation is a Hidden Markov Model (HMM). Let the set of states be Q = (Q)1,q2,…qN) I.e., a corpus of tagged parts of speech (e.g., beginning of word, middle of word, end of word); the observation set is V ═ V (V)1,v2,…qM) I.e. a complete set of characters to be participled input by the user; the observed sequence is O = (O)1,o2,…oT) Namely, the input character sequence to be divided; the state sequence is I ═ I (I)1,i2,…iT) I.e. a possible sequence of part-of-speech tags to be divided into character sequences. Firstly, determining a used corpus, and then obtaining three parameters of a hidden Markov model by a statistical method, wherein the three parameters are respectively a state transition probability matrix A ═ aij]N×NObservation probability matrix B = [ B ]j(k)]N×MThe initial state probability vector pi ═ pi (pi)i). Wherein:
πi=P(qi)
where count represents frequency, obtained from training data.
When HMM model λ ═ (a, B, pi) is determined, the word segmentation is performed using the viterbi algorithm. Defining all the individual paths (i) with state i at time t1,i2,…it-1,itOf) the maximum value of the probability is δt(i) Setting the t-1 th node of the path with the maximum probability as psit(i) In that respect First, initialize, make delta1(i)=πibi(o1),Ψ1(i) =0, i =1, 2, …, N. Then, recurrently, for T2, …, T, respectively:
i=1,2,…,N
finally, let P
*=max
1≤i≤Nδ
T(i) And is
P
*The probability of the optimal path is represented,
representing the end point of the optimal path. After finding the end point of the optimal path, backtracking is carried out, and for T-1, T-2 and … 1, the order is given
Finding an optimal path
The optimal path is the output hidden state sequence, namely the corresponding word segmentation result. The same method is used for segmenting the document information of the pictures in the library, and the document is indexed by using the classical inverted index, so that the efficient retrieval is facilitated.
Next, we extract the content features of the query picture, including color features, texture features, and shape features. First we build a color histogram. When a user submits a query picture Q, preprocessing the submitted picture Q, and then counting a histogram in a color vector space, wherein the color histogram is a one-dimensional discrete function, namely:
in the formula, nkThe quantized color feature value is the number of pixels with k, N is the total number of image pixels, and l is the quantized color feature value number, i.e., the dimension of the one-dimensional vector H. Thereby obtaining a color histogram vector H of the query image QQ. Under the offline condition, the color histogram features are extracted and the indexes are built for the pictures in the library in the same way.
Next we extract Scale-Invariant Feature (sift-Invariant Feature), describing the texture features of the picture. When a user submits a query picture I (x, y), the scale space of the query picture is as follows:
L(x,y,σ)=G(x,y,σ)*I(x,y)
g (x, y, σ) is a gaussian function (sigma is a scale parameter). The Difference of gaussians (Difference of Gaussian) of the neighboring scale images is then calculated, i.e.:
D(x,y,σ)=L(x,y,kσ)-L(x,y,σ)
wherein k is typically 21/3。
After the Gaussian difference of the adjacent scale images is calculated, a series of images are obtained, and an extreme point is obtained in the image space. And respectively comparing a pixel point in each Gaussian difference image with all adjacent points thereof to see whether the pixel point is larger or smaller than the adjacent points of the image domain and the scale domain. After the extreme points are solved, curve fitting needs to be carried out on the DoG function in the scale space to screen the extreme points, and the points on the low contrast ratio and the edge are removed:
wherein,
it is indicated the offset of the sample points,
is the extreme value of X. For each candidate extreme point
If the value is less than a threshold (generally 0.03), the candidate extreme point is determined to be an unstable extreme point with low contrast and removed.
In order to obtain stable extreme points, the influence of the edges should be removed, when
And reserving key points, and otherwise, removing the key points. The key points are the feature points we are looking for. Wherein,
is the session matrix, DxxIs to derive twice in the x-direction of an image of a certain scale in DoG space. Tr (H) is the trace of the H matrix, and Det (H) is the determinant of matrix H. α is a large eigenvalue of the H matrix, β is a small eigenvalue of the H matrix, and γ is α/β.
After the positions of the feature points of the image are determined, next, a direction is assigned to the feature points of the image by solving the gradient of the neighborhood of each feature point, and then the gradient amplitude m (x, y) and the gradient direction θ (x, y) are defined as:
a region is defined by taking the characteristic point as a center, and a direction histogram is formed by utilizing the gradients of all points in the region. And selecting one item with the largest ordinate value from the histogram as the main direction of the feature point. If there are other directions, the magnitude of the ordinate is greater than 80% of the principal direction ordinate, and this direction is also taken as the direction of the feature point.
And after the feature point detection is finished, determining a descriptor of the feature point. Firstly, the neighborhood of the feature point is rotated by theta (adjusted to 0 °) with the feature point as the center, where theta is the direction of the feature point. In the rotated image, a neighborhood window of 16 × 16 is taken with the feature point as the center, and each cell represents one pixel in the neighborhood window of the feature point. Uniformly dividing a 16 × 16 rectangular window into 16 sub-regions, increasing the weight value of a neighborhood close to the feature point and decreasing the weight value of a neighborhood far away from the feature point by adopting a Gaussian blur method, and then calculating gradient histograms of 8 directions in each region to obtain a feature vector of the feature point descriptor, wherein the feature vector is a 4 × 4 × 8= 128-dimensional vector. Next, the feature point descriptor is normalized, and D is the feature point descriptor, that is, D is (D)1,d2,…d128) And obtaining after normalization:
in order to reduce the influence of the large gradient value, a threshold value of 0.2 is set for the large gradient value, if the value of a certain dimension in the vector is greater than 0.2, the value is set to be 0.2, and normalization processing is carried out again.
In the off-line case, all feature point descriptors of the pictures in the library are obtained by the above steps, and the descriptors are clustered. The obtained clusters are used as visual Words, and a Bag of Words model (Bag of Words) is applied to carry out inverted indexing on the pictures in the library. And then applying the same bag-of-words model to obtain the feature vector expression of the query picture.
Finally we index the global shape features of the picture. After a user submits a query picture, firstly, a Gabor filter is used for sampling and filtering the query picture according to the following formula:
wherein,
l is the size of the filter; k is a normal number; σ is the standard deviation of the Gaussian function; thetai=π(i-1)/θl,i=1,2,...,θl,θlIs the total number of directions in the dimension l. Subjecting the image to Gabor filterAnd (4) convolution is carried out, and the filtered image is obtained as follows:
dividing the filtered picture into 4 x 4 grids, taking the average value in each grid, and finally putting the average values obtained in the grids of all directions and scales in a vector as the shape characteristics of the query picture. In the off-line indexing step, the same calculation is carried out on the pictures in the library to obtain a shape feature index (a k-d tree index and a hash index are established for efficient retrieval) so as to match the picture shape features.
After the text information after word segmentation and the content feature information of the query picture are obtained, in step S103, a related picture is searched for in the joint index of the text and the image according to the obtained information.
Fig. 2 is a schematic flow chart of the search method, and shows a specific implementation method of step S103: we build a search (IR) model based on each single-modality (single-item) feature separately (step S201). Each IR model is independently operated and freely configured, so that different IR models can be selected to be combined according to actual conditions, and finally, a list containing results is returned according to corresponding sorting algorithms. Then, in step S202, the text information after word segmentation and/or the content features extracted from the query picture are input into the corresponding IR model to obtain a plurality of ordered lists, and in step S203, the ordered lists are fused to finally obtain a comprehensive ordered list and returned to the user. The embodiment of the joint search method combines the result output by the text-based retrieval model and the result output by the image-content-based retrieval model to obtain a comprehensive ordered list, wherein the comprehensive ordered list comprises the returned picture results and is arranged according to the descending order of the correlation degree with the query information.
Specifically, in step S201, a text-based IR model is first established. Preferably, the text IR model is created using Statistical Language Modeling (Statistical Language Modeling). Let V denote a dictionary (vocarbulariy) of a certain language, V ═ ω1,ω2,…,ω|v|}, call omegaiIs a term (term), D is a document in the document set C, D ═ D1d2…dn,diE.g. V. In the statistical translation model, when a user submits text query information Q, Q ═ Q1q2…qm,qiE.g. V, the probability that the document D is "translated" into the query information Q is:
where P (ω | D) is the basic document language model, t (q)iω) is the translation probability. After P (Q | D) is calculated, the ranking of the documents in the document set needs to be returned. At this time, we need to estimate the posterior summaryThe ratio P (D | Q) is, according to the bayesian formula:
where P (D) may take some query-independent measure, this term is not considered in the model. After the posterior probability P (D | Q) is calculated, the documents in the document set may be ranked according to the probability values, and a ranked list may be returned, as in step S202.
When the user submits the query picture, in step S201, a plurality of IR models based on image content features are established. Each IR model corresponds to an image feature, including color features, texture features, shape features, and the like. As previously mentioned, these image content features are all represented in vector space. Therefore, the similarity between feature vectors needs to be measured. The similarity is preferably calculated using euclidean distance. Two n-dimensional vectors (x)11,x12,…x1n) And (x)21,x22,…x2n) The Euclidean distance between them is:
larger distances indicate less correlation. After the similarity calculation is finished, the sorted list is returned in descending order of the degree of correlation as by step S202.
It should be noted that several lists are returned for several features extracted from the picture. For example, if two features, color and texture, of an image are extracted separately, two sorted lists are returned. Each list is sorted in descending order according to the degree of correlation with the query picture on the corresponding visual feature.
After obtaining a plurality of picture relevancy ranking lists, respectively giving each picture d on the listjAssigning a score SHLFIRMThe formula is as follows:
where ψ (x, H) denotes the position of picture x in list H, 1aIs an indicator function when a is true, i.e. when d is truejBelong to list LiIf so, 1 is taken, otherwise, 0 is taken. Alpha is alphaiIs the weight of the ith IR model, andintuitively, pictures that appear at a front position in the lists will get a higher score, with higher scoring pictures being more relevant to the query information. A score is computationally defined and sorted by the score as in steps S203 and S204.
In the above equation, if one of the IR models has better performance than the other IR models, a higher weight value should be assigned to the model to improve the performance of the whole system. In the embodiment of the invention, the weight alpha is set by adopting an automatic optimization methodi. Alternative automatic optimization methods include genetic algorithms, annealing algorithms, and the like.For example, in a genetic algorithm, we first initialize the weights of each model, and generate an initial population of weight vectors through multiple random initializations. Next, the fitness of each individual in the population is measured. In image retrieval, the adaptability of a weight vector is measured by the performance of a search result generated by the weight vector, namely, a plurality of test queries are given and relevant picture sets corresponding to the queries are obtained, and then the performance of the search result is output as an adaptability index according to a relevant picture set calculation program. There are many indexes for measuring the performance of the search result, such as F1 score, Normalized counted relational Gain, Mean Average Precision, etc. Then, the individuals are selected, crossed and mutated according to the adaptability, and a new generation of population is generated. The new generation of population information is superior to the previous generation. And (3) repeatedly, continuously improving the fitness of the weight vector until the algorithm termination condition is met: within the maximum iteration time limit, obtaining an individual meeting a preset weight vector fitness target value; or the maximum iteration number is reached, and then all the individuals with the highest fitness among all the generated individuals are returned.
The embodiment of the present invention may further include step S105: if the user is satisfied with the output result, the search process is ended; if the user is not satisfied with the output result or deviates from his/her previous idea, the query text and/or query picture can be supplemented or modified based on the previous query information. The method performs word segmentation and feature extraction on the modified text and/or picture, and repeats steps S102 to S105 until a result satisfied by the user is output.
Fig. 3 shows the results of the retrieval mechanism of the present invention compared to a single modality image retrieval mechanism. The upper part of the right half part of the figure shows the result of simple text retrieval, the middle part shows the result of simple image content retrieval, and the lower part is the result returned by the hybrid retrieval mechanism provided by the invention. The returned image results show that the performance of the method is greatly improved compared with the performance of the traditional single-mode retrieval method, and the requirements of users are met to a greater extent.
The image search system according to the present invention is a system corresponding to the above method, and as shown in fig. 4, includes:
the query information receiving terminal is used for receiving a query picture and/or a query text submitted by a user;
the query information processing module is used for extracting various content characteristics of the query picture and segmenting the query text;
the similarity single-item sequencing module is used for comparing various content characteristics of the inquired pictures with corresponding content characteristics of each picture in the database, and sequencing the pictures in the database according to the similarity to obtain various lists of content similarity; comparing the query text after word segmentation with descriptive documents corresponding to each picture in the database, and sequencing the pictures in the database according to the similarity to obtain a text similarity list;
and the comprehensive similarity sorting module is used for assigning scores to each picture in the database according to the positions in the lists and the weight of the list where the picture is located, re-sorting the pictures according to the assigned scores to obtain a comprehensive similarity sorting list, and returning the list to the user.
As a preferred embodiment, the content features of the picture include color features, texture features, and shape features.
As a preferred embodiment, the query information processing module performs word segmentation on the query text by using a hidden markov model.
As a preferred embodiment, the similarity single item ranking module measures the similarity between the query text after word segmentation and the descriptive document corresponding to each picture in the database by adopting a statistical language modeling method; and calculating the similarity between various content characteristics of the query picture and corresponding content characteristics of each picture in the database by adopting the Euclidean distance.
As a preferred embodiment, the comprehensive ranking module of similarity uses a genetic algorithm or an annealing algorithm to set the weight of each list.
In conclusion, the beneficial effects of the invention are as follows:
the invention improves the practicability of image search: first, most of the conventional search methods are based on monomodal search, and the way of expressing the query intention of the user is limited to a certain extent. Second, content-based image retrieval faces the semantic gap problem. The invention changes the status by combining the multi-modal information of text and image content.
The invention improves the flexibility of image search: most of the conventional image searching methods utilize several fixed features for searching, but the invention has the characteristic that feature combinations can be flexibly and freely configured.
The invention improves the accuracy of image search: according to the method, the return results based on the text image retrieval method and the content image retrieval method are combined, so that a more accurate picture relevancy ranking list is obtained, and the accuracy of the return results is greatly improved.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.