representing the end point of the optimal path. After finding the end point of the optimal path, backtracking is carried out, and for T-1, T-2 and … 1, the order is given

Finding an optimal path

The optimal path is the output hidden state sequence, namely the corresponding word segmentation result. The same method is used for segmenting the document information of the pictures in the library, and the document is indexed by using the classical inverted index, so that the efficient retrieval is facilitated.

Next, we extract the content features of the query picture, including color features, texture features, and shape features. First we build a color histogram. When a user submits a query picture Q, preprocessing the submitted picture Q, and then counting a histogram in a color vector space, wherein the color histogram is a one-dimensional discrete function, namely:

in the formula, n_kThe quantized color feature value is the number of pixels with k, N is the total number of image pixels, and l is the quantized color feature value number, i.e., the dimension of the one-dimensional vector H. Thereby obtaining a color histogram vector H of the query image Q_Q. Under the offline condition, the color histogram features are extracted and the indexes are built for the pictures in the library in the same way.

Next we extract Scale-Invariant Feature (sift-Invariant Feature), describing the texture features of the picture. When a user submits a query picture I (x, y), the scale space of the query picture is as follows:

L(x，y，σ)=G(x，y，σ)*I(x，y)

g (x, y, σ) is a gaussian function (sigma is a scale parameter). The Difference of gaussians (Difference of Gaussian) of the neighboring scale images is then calculated, i.e.:

D(x，y，σ)＝L(x，y，kσ)-L(x，y，σ)

wherein k is typically 2^1/3。

After the Gaussian difference of the adjacent scale images is calculated, a series of images are obtained, and an extreme point is obtained in the image space. And respectively comparing a pixel point in each Gaussian difference image with all adjacent points thereof to see whether the pixel point is larger or smaller than the adjacent points of the image domain and the scale domain. After the extreme points are solved, curve fitting needs to be carried out on the DoG function in the scale space to screen the extreme points, and the points on the low contrast ratio and the edge are removed:

D (\overset{&OverBar;}{X}) = D + \frac{1}{2} \frac{&PartialD; D^{T}}{&PartialD; X} \overset{&OverBar;}{X}

wherein,

it is indicated the offset of the sample points,

is the extreme value of X. For each candidate extreme point

If the value is less than a threshold (generally 0.03), the candidate extreme point is determined to be an unstable extreme point with low contrast and removed.

In order to obtain stable extreme points, the influence of the edges should be removed, when

\frac{Tr {(H)}^{2}}{Det (H)} < \frac{{(r + 1)}^{2}}{r}

And reserving key points, and otherwise, removing the key points. The key points are the feature points we are looking for. Wherein,

H = (\begin{matrix} D_{xx} & D_{xy} \\ D_{xy} & D_{yy} \end{matrix})

is the session matrix, D_xxIs to derive twice in the x-direction of an image of a certain scale in DoG space. Tr (H) is the trace of the H matrix, and Det (H) is the determinant of matrix H. α is a large eigenvalue of the H matrix, β is a small eigenvalue of the H matrix, and γ is α/β.

After the positions of the feature points of the image are determined, next, a direction is assigned to the feature points of the image by solving the gradient of the neighborhood of each feature point, and then the gradient amplitude m (x, y) and the gradient direction θ (x, y) are defined as:

m (x, y) = \sqrt{{(L (x + 1, y) - L (x - 1, y))}^{2} + {(L (x, y + 1) - L (x, y - 1))}^{2}}

θ (x, y) = \tan^{- 1} (L (x, y + 1) - L (x, y - 1)) / (L (x + 1, y) - L (x - 1, y))

a region is defined by taking the characteristic point as a center, and a direction histogram is formed by utilizing the gradients of all points in the region. And selecting one item with the largest ordinate value from the histogram as the main direction of the feature point. If there are other directions, the magnitude of the ordinate is greater than 80% of the principal direction ordinate, and this direction is also taken as the direction of the feature point.

And after the feature point detection is finished, determining a descriptor of the feature point. Firstly, the neighborhood of the feature point is rotated by theta (adjusted to 0 °) with the feature point as the center, where theta is the direction of the feature point. In the rotated image, a neighborhood window of 16 × 16 is taken with the feature point as the center, and each cell represents one pixel in the neighborhood window of the feature point. Uniformly dividing a 16 × 16 rectangular window into 16 sub-regions, increasing the weight value of a neighborhood close to the feature point and decreasing the weight value of a neighborhood far away from the feature point by adopting a Gaussian blur method, and then calculating gradient histograms of 8 directions in each region to obtain a feature vector of the feature point descriptor, wherein the feature vector is a 4 × 4 × 8= 128-dimensional vector. Next, the feature point descriptor is normalized, and D is the feature point descriptor, that is, D is (D)₁,d₂，…d₁₂₈) And obtaining after normalization:

\overset{&OverBar;}{D} = \frac{D}{\sqrt{Σ_{i = 1}^{128} d_{i}^{2}}} = (\overset{&OverBar;}{d_{1}}, \overset{&OverBar;}{d_{2}}, \cdot \cdot \cdot \overset{&OverBar;}{d_{128}})

in order to reduce the influence of the large gradient value, a threshold value of 0.2 is set for the large gradient value, if the value of a certain dimension in the vector is greater than 0.2, the value is set to be 0.2, and normalization processing is carried out again.

In the off-line case, all feature point descriptors of the pictures in the library are obtained by the above steps, and the descriptors are clustered. The obtained clusters are used as visual Words, and a Bag of Words model (Bag of Words) is applied to carry out inverted indexing on the pictures in the library. And then applying the same bag-of-words model to obtain the feature vector expression of the query picture.

Finally we index the global shape features of the picture. After a user submits a query picture, firstly, a Gabor filter is used for sampling and filtering the query picture according to the following formula:

wherein,

\{\begin{matrix} x_{r_{θ_{i}}} = x \cos (θ_{i}) + y \sin (θ_{i}) \\ y_{r_{θ_{i}}} = - x \sin (θ_{i}) + y \cos (θ_{i}) \end{matrix}

l is the size of the filter; k is a normal number; σ is the standard deviation of the Gaussian function; theta_i=π(i-1)/θ_l，i=1,2,...,θ_l,θ_lIs the total number of directions in the dimension l. Subjecting the image to Gabor filterAnd (4) convolution is carried out, and the filtered image is obtained as follows:

F_{θ_{i}}^{l} = G_{θ_{i}}^{l} * I

dividing the filtered picture into 4 x 4 grids, taking the average value in each grid, and finally putting the average values obtained in the grids of all directions and scales in a vector as the shape characteristics of the query picture. In the off-line indexing step, the same calculation is carried out on the pictures in the library to obtain a shape feature index (a k-d tree index and a hash index are established for efficient retrieval) so as to match the picture shape features.

After the text information after word segmentation and the content feature information of the query picture are obtained, in step S103, a related picture is searched for in the joint index of the text and the image according to the obtained information.

Fig. 2 is a schematic flow chart of the search method, and shows a specific implementation method of step S103: we build a search (IR) model based on each single-modality (single-item) feature separately (step S201). Each IR model is independently operated and freely configured, so that different IR models can be selected to be combined according to actual conditions, and finally, a list containing results is returned according to corresponding sorting algorithms. Then, in step S202, the text information after word segmentation and/or the content features extracted from the query picture are input into the corresponding IR model to obtain a plurality of ordered lists, and in step S203, the ordered lists are fused to finally obtain a comprehensive ordered list and returned to the user. The embodiment of the joint search method combines the result output by the text-based retrieval model and the result output by the image-content-based retrieval model to obtain a comprehensive ordered list, wherein the comprehensive ordered list comprises the returned picture results and is arranged according to the descending order of the correlation degree with the query information.

Specifically, in step S201, a text-based IR model is first established. Preferably, the text IR model is created using Statistical Language Modeling (Statistical Language Modeling). Let V denote a dictionary (vocarbulariy) of a certain language, V ═ ω₁，ω₂，…，ω_|v|}, call omega_iIs a term (term), D is a document in the document set C, D ═ D₁d₂…d_n，d_iE.g. V. In the statistical translation model, when a user submits text query information Q, Q ═ Q₁q₂…q_m，q_iE.g. V, the probability that the document D is "translated" into the query information Q is:

P (Q | D) = Π_{i = 1}^{m} \underset{ω}{Σ} t (q_{i} | ω) P (ω | D) - - - (1)

where P (ω | D) is the basic document language model, t (q)_iω) is the translation probability. After P (Q | D) is calculated, the ranking of the documents in the document set needs to be returned. At this time, we need to estimate the posterior summaryThe ratio P (D | Q) is, according to the bayesian formula:

P (D | Q) = \frac{P (Q | D) P (D)}{P (Q)} &Proportional; P (Q | D) P (D) - - - (2)

where P (D) may take some query-independent measure, this term is not considered in the model. After the posterior probability P (D | Q) is calculated, the documents in the document set may be ranked according to the probability values, and a ranked list may be returned, as in step S202.

When the user submits the query picture, in step S201, a plurality of IR models based on image content features are established. Each IR model corresponds to an image feature, including color features, texture features, shape features, and the like. As previously mentioned, these image content features are all represented in vector space. Therefore, the similarity between feature vectors needs to be measured. The similarity is preferably calculated using euclidean distance. Two n-dimensional vectors (x)₁₁，x₁₂，…x_1n) And (x)₂₁，x₂₂，…x_2n) The Euclidean distance between them is:

d_{12} = \sqrt{Σ_{k = 1}^{n} {(x_{1 k} - x_{2 k})}^{2}}

larger distances indicate less correlation. After the similarity calculation is finished, the sorted list is returned in descending order of the degree of correlation as by step S202.

It should be noted that several lists are returned for several features extracted from the picture. For example, if two features, color and texture, of an image are extracted separately, two sorted lists are returned. Each list is sorted in descending order according to the degree of correlation with the query picture on the corresponding visual feature.

After obtaining a plurality of picture relevancy ranking lists, respectively giving each picture d on the list_jAssigning a score S_HLFIRMThe formula is as follows:

S_{HLFIRM} (d_{j}) = (Σ_{i = 1}^{N} 1_{d_{j &Element; L_{i}}}) \times (α_{i} \times \frac{1}{ψ (d_{j}, L_{i},)})

where ψ (x, H) denotes the position of picture x in list H, 1_aIs an indicator function when a is true, i.e. when d is true_jBelong to list L_iIf so, 1 is taken, otherwise, 0 is taken. Alpha is alpha_iIs the weight of the ith IR model, andintuitively, pictures that appear at a front position in the lists will get a higher score, with higher scoring pictures being more relevant to the query information. A score is computationally defined and sorted by the score as in steps S203 and S204.

In the above equation, if one of the IR models has better performance than the other IR models, a higher weight value should be assigned to the model to improve the performance of the whole system. In the embodiment of the invention, the weight alpha is set by adopting an automatic optimization method_i. Alternative automatic optimization methods include genetic algorithms, annealing algorithms, and the like.For example, in a genetic algorithm, we first initialize the weights of each model, and generate an initial population of weight vectors through multiple random initializations. Next, the fitness of each individual in the population is measured. In image retrieval, the adaptability of a weight vector is measured by the performance of a search result generated by the weight vector, namely, a plurality of test queries are given and relevant picture sets corresponding to the queries are obtained, and then the performance of the search result is output as an adaptability index according to a relevant picture set calculation program. There are many indexes for measuring the performance of the search result, such as F1 score, Normalized counted relational Gain, Mean Average Precision, etc. Then, the individuals are selected, crossed and mutated according to the adaptability, and a new generation of population is generated. The new generation of population information is superior to the previous generation. And (3) repeatedly, continuously improving the fitness of the weight vector until the algorithm termination condition is met: within the maximum iteration time limit, obtaining an individual meeting a preset weight vector fitness target value; or the maximum iteration number is reached, and then all the individuals with the highest fitness among all the generated individuals are returned.

The embodiment of the present invention may further include step S105: if the user is satisfied with the output result, the search process is ended; if the user is not satisfied with the output result or deviates from his/her previous idea, the query text and/or query picture can be supplemented or modified based on the previous query information. The method performs word segmentation and feature extraction on the modified text and/or picture, and repeats steps S102 to S105 until a result satisfied by the user is output.

Fig. 3 shows the results of the retrieval mechanism of the present invention compared to a single modality image retrieval mechanism. The upper part of the right half part of the figure shows the result of simple text retrieval, the middle part shows the result of simple image content retrieval, and the lower part is the result returned by the hybrid retrieval mechanism provided by the invention. The returned image results show that the performance of the method is greatly improved compared with the performance of the traditional single-mode retrieval method, and the requirements of users are met to a greater extent.

The image search system according to the present invention is a system corresponding to the above method, and as shown in fig. 4, includes:

As a preferred embodiment, the content features of the picture include color features, texture features, and shape features.

As a preferred embodiment, the query information processing module performs word segmentation on the query text by using a hidden markov model.

As a preferred embodiment, the similarity single item ranking module measures the similarity between the query text after word segmentation and the descriptive document corresponding to each picture in the database by adopting a statistical language modeling method; and calculating the similarity between various content characteristics of the query picture and corresponding content characteristics of each picture in the database by adopting the Euclidean distance.

As a preferred embodiment, the comprehensive ranking module of similarity uses a genetic algorithm or an annealing algorithm to set the weight of each list.

In conclusion, the beneficial effects of the invention are as follows:

the invention improves the practicability of image search: first, most of the conventional search methods are based on monomodal search, and the way of expressing the query intention of the user is limited to a certain extent. Second, content-based image retrieval faces the semantic gap problem. The invention changes the status by combining the multi-modal information of text and image content.

The invention improves the flexibility of image search: most of the conventional image searching methods utilize several fixed features for searching, but the invention has the characteristic that feature combinations can be flexibly and freely configured.

The invention improves the accuracy of image search: according to the method, the return results based on the text image retrieval method and the content image retrieval method are combined, so that a more accurate picture relevancy ranking list is obtained, and the accuracy of the return results is greatly improved.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image retrieval method, comprising the steps of:

receiving a query picture and/or a query text submitted by a user;

2. The image retrieval method according to claim 1,

the content features of the picture include color features, texture features, and shape features.

3. The image retrieval method according to claim 1 or 2,

and segmenting the query text by adopting a hidden Markov model.

4. The image retrieval method according to claim 1 or 2,

measuring the similarity between the query text after word segmentation and the descriptive document corresponding to each picture in the database by adopting a statistical language modeling method;

and calculating the similarity between various content characteristics of the query picture and corresponding content characteristics of each picture in the database by adopting the Euclidean distance.

5. The image retrieval method according to claim 1 or 2,

the weight of each list is set using a genetic algorithm or an annealing algorithm.

6. An image retrieval system, comprising:

7. The image retrieval system of claim 6,

8. Image retrieval system according to claim 6 or 7,

and the query information processing module adopts a hidden Markov model to perform word segmentation on the query text.

9. Image retrieval system according to claim 6 or 7,

the similarity single item ordering module measures the similarity between the query text after word segmentation and a descriptive document corresponding to each picture in a database by adopting a statistical language modeling method; and calculating the similarity between various content characteristics of the query picture and corresponding content characteristics of each picture in the database by adopting the Euclidean distance.

10. Image retrieval system according to claim 6 or 7,

the comprehensive similarity sorting module adopts a genetic algorithm or an annealing algorithm to set the weight of each list.