CN111310666B

Movatterモバイル変換

Info

Publication number: CN111310666B
Application number: CN202010099370.0A
Authority: CN
Inventors: 吴炜; 高明; 范菁; 夏列钢; 杨海平; 陈婷婷
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2022-03-18
Anticipated expiration: 2040-02-18
Also published as: CN111310666A

Abstract

Translated fromChinese

一种基于纹理特征的高分辨率影像地物识别与分割方法，包括：步骤1，根据类别体系制作样本集；步骤2，构造深度学习网络模型，内含：步骤2.1，构造骨干网络、步骤2.2，构造纹理特征提取结构、步骤2.3，构造特征矩阵去噪结构、步骤2.4，构造上采样结构；步骤3，深度学习网络模型训练；步骤4，图像预测；步骤5，分割结果后处理。本发明使用深度学习网络框架，通过重新设定网络框架下采样倍数，并显式设定纹理信息提取结构，不仅减少了小目标的信息损失，也提高了纹理信息的表达能力；深度网络模型中添加特征矩阵去噪模块，减少方法在计算中带来的额外噪声，实现逐像素的纹理表达，进一步提高网络模型精度。

A high-resolution image feature recognition and segmentation method based on texture features, comprising: step 1, creating a sample set according to a category system; step 2, constructing a deep learning network model, including: step 2.1, constructing a backbone network, step 2.2 , construct texture feature extraction structure, step 2.3, construct feature matrix denoising structure, step 2.4, construct upsampling structure; step 3, deep learning network model training; step 4, image prediction; step 5, post-processing of segmentation results. The invention uses a deep learning network framework, and by resetting the downsampling multiple of the network framework and explicitly setting the texture information extraction structure, not only reduces the information loss of small targets, but also improves the expression ability of texture information; in the deep network model The feature matrix denoising module is added to reduce the extra noise brought by the method in the calculation, realize pixel-by-pixel texture expression, and further improve the accuracy of the network model.

Description

High-resolution image ground feature identification and segmentation method based on texture features

Technical Field

The invention discloses a technology relating to ultrahigh resolution image information extraction, in particular to a method for identifying and segmenting ground objects from ultrahigh resolution images acquired by an unmanned aerial vehicle or a satellite according to the texture characteristics of the ground objects.

Background

The rapid development of the light and small unmanned aerial vehicle technology enables the unmanned aerial vehicle remote sensing to be generally adopted. Compared with satellite remote sensing, the unmanned aerial vehicle flies in low altitude and is not influenced by noise factors such as cloud, cloud shadow and the like; compared with the traditional aerial remote sensing, the data acquisition cost is greatly reduced. The unmanned aerial vehicle has the advantages of flexible and mobile data acquisition mode and capability of acquiring ultrahigh-resolution images, so that the unmanned aerial vehicle is widely applied to small-area applications such as agricultural loss investigation, area statistics and sample collection, and becomes an important supplement for satellite and traditional aerial remote sensing.

Compared with the medium-low resolution images, the unmanned aerial vehicle images have fewer wave bands, often adopt non-measurement cameras, and lack fine radiation correction. However, the unmanned aerial vehicle image has ultrahigh spatial resolution, and can accurately represent the spatial distribution of pixels inside the earth surface, namely the arrangement, combination and contrast of pixels with different colors, so as to form unique texture features. Therefore, research on ground feature identification and segmentation based on texture features is of great significance to the utilization of unmanned aerial vehicle images, and for this reason, researchers at home and abroad develop various methods which mainly include three types, namely a statistical method, a model method and a deep learning method.

(1) The statistical-based texture feature representation method comprises the following steps: the method realizes the representation of the texture by defining the statistical indexes in the local area. The gray level co-occurrence matrix is a representative method, the method firstly constructs the co-occurrence matrix by calculating the co-occurrence relationship between the gray level values of pixels in an image and adjacent areas, and defines a series of derived indexes such as entropy, moment and the like on the basis, thereby describing the texture characteristics of the areas. However, various metrics of the gray level co-occurrence matrix are sensitive only to certain special textures, and may lack distinctiveness for other types of textures; meanwhile, how to define the adjacent regions and the gray level co-occurrence relationship thereof also affects the performance of the algorithm.

(2) The texture feature representation method based on the model comprises the following steps: the method firstly models the distribution of pixels according to a certain mode, extracts the textural features and converts the textural features into a model parameter estimation problem, and typical methods are a random field model and a visual word bag model.

The Random field model describes the statistical dependence of pixels on neighboring pixels in an image by a probabilistic model, such as a Markov Random field model, which describes texture assuming that any pixel is only related to neighboring pixels (Kendiuywo B K, Bargiel D, Soergel U,. high Order Dynamic Random Fields Ensemble for op Type Classification in radial Images [ J ]. IEEE Transactions on Geoscience and Remote Sensing,2017,55(8): 4638-. The method has the problems that different models have different performances on different texture characteristics, the model parameters are complex, and various optimization algorithms are required, so that uncertainty is brought to the method.

The texture feature expression method of the visual word bag model comprises three main steps of feature extraction, feature clustering and feature coding. The feature extraction is to extract a series of feature points from an image, then calculate the high-dimensional features of each feature point, and commonly use local feature descriptors such as SIFT or SURF and the like; the feature clustering is to perform unsupervised clustering on feature points, and take a clustering center as a feature code, so as to construct a visual dictionary; the feature coding is to calculate the response of the current feature on each visual word to obtain the feature vector representation of the local region, and the feature vector represents the response of the feature point to each visual word, thereby describing the texture feature.

Although the methods based on statistics and models obtain good results in application, the method also has the problem that the feature extraction methods are all preset, cannot be adjusted according to texture content on the image, are window-based descriptors, are mainly applied to image classification, and cannot realize the segmentation precision of the ground features at the pixel level.

(3) The texture representation method based on deep learning comprises the following steps: the deep learning method based on the convolutional neural network has great success in the fields of computing vision and the like, firstly, the deep convolutional network is used for carrying out feature extraction on an image, information such as color, structure, local correlation and the like in the image is captured through superposition of convolutional layers, then pooling is carried out in a cascading mode, however, the features obtained through convolution keep the spatial arrangement relation among pixels in a convolution window, and the features such as spatial arrangement and the like are effective. However, the texture reflects the arrangement and combination of local gray scale or color in a certain area and the repeatability in space, rather than the arrangement of local features, so that the traditional convolution feature extraction is not suitable for texture expression.

Therefore, the problem that the convolution method is insufficient in texture feature expression is solved. The features extracted by the CNN are quantized by methods such as VLAD to achieve statistical description of local feature distribution, thereby representing texture. However, the feature extraction and the feature quantization in the above method are two steps performed separately, and cannot be optimized simultaneously. In contrast, Deep TEN introduces a coding layer, learns visual words through samples, obtains the distribution of each feature in the visual words, and replaces cascade through accumulation to realize efficient description of Texture features, thereby obtaining good effect on classification based on Texture images (Zhang H, Xue J, Dana K,. Deep TEN: Texture encoding network [ C ]// Proceedings-30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017.2017: 2896-. However, the method mainly realizes the classification of the texture-rich target, cannot obtain the texture expression pixel by pixel, and realizes the segmentation of the target object according to the different texture characteristics of each terrain. Aiming at the problem, the EncNet provides a new network structure, and on the basis of the traditional CNN extraction characteristics, an Encoding layer is added to realize the disorder of ordered characteristics, thereby realizing the understanding of local environment, realizing the Segmentation at the pixel level, and obtaining the optimal Segmentation result in a plurality of Segmentation tasks (ZHANG H, DANA K, SHI J et al, Context Encoding for the sensory Segmentation [ C ]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern registration.2018: 7151-.

The algorithm based on deep learning continuously and iteratively learns the texture information in the image data in an end-to-end mode, adjusts the model parameters and has strong adaptability. The method makes a great breakthrough in the fields of image classification, image segmentation, target tracking and the like. Therefore, the feature identification and segmentation of the ultrahigh-resolution image by using the texture representation method based on the deep learning have important application potential, but the current deep learning network also has the following problems:

in order to grasp the whole image information, a plurality of times of image down-sampling operations are carried out, and an object with a small area may only occupy one pixel in a feature map after the down-sampling operations are carried out for a plurality of times, so that the influence of information loss on a classification task is small, but the division of a small target is often not negligible;

because a plurality of pixels with different characteristics exist in the same ground object on the ultrahigh-resolution image, texture information of an adjacent target interferes with a current point in local convolution calculation after down sampling in a depth network, and thus extra noise is brought to a calculation result.

In order to solve the problems, the invention provides a deep learning framework which realizes the identification and segmentation of ground objects on ultrahigh-resolution images of unmanned aerial vehicles and the like by using texture features of the ground objects.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides an unmanned aerial vehicle image ground object identification and segmentation method based on texture features. The method is based on a deep convolutional network model, and effectively improves the capture capability of the target texture information in the image by improving the perception mode of pixels in the image to the global information; the sensitivity degree of small target detection is improved by reducing the multiple of down sampling in the network; and denoising the noise of convolution calculation after downsampling by adding a characteristic matrix denoising structure to obtain a smooth segmentation result.

After the images of the research area are acquired, the data are divided into image blocks with the resolution H multiplied by W according to the requirements of experimental equipment capacity and experimental efficiency. In order to reduce the loss of edge area information caused by image segmentation, the image is overlapped in a certain pixel size when being partitioned, the number O of overlapped pixels is determined according to the requirements of experiment precision and efficiency, and then the image is partitioned according to the number O of overlapped pixels.

The invention discloses a high-resolution image ground feature identification and segmentation method based on texture features, which comprises the following steps of:

step 1, manufacturing a sample set according to a category system;

according to the research objective, determining a ground object classification system of the research area, and assuming n different classes CL of the research area:

CL＝{cl₁,cl₂,...,cl_n} (1)

and (4) making a sample set according to the category system, wherein the sample comprises positive samples of all categories.

The sample represents the area of the type ground object in a polygonal mode and uses cl_ie.CL identifies its category.

The number of samples needs to meet the training requirement, and if the number of samples is insufficient, sample enhancement is performed to improve the number of samples.

Step 2, constructing a deep learning network model;

the network model is divided into four parts: the first part is a backbone network and is used for extracting basic features of the image; the second part is a texture feature extraction structure; the third part is a characteristic matrix denoising structure; and the fourth part is an up-sampling structure, and the de-noised characteristic matrix is up-sampled to the size of the original image to obtain the image category and the segmentation result thereof.

Step 2.1, constructing a backbone network;

the partial network extracts the basic characteristics of the image and constructs a backbone network based on ResNet.

ResNet is composed of five convolution modules, a convolution kernel with the step length of 2 is used in the first convolution module, and the size of the output characteristic image is 1/2 of the original image; in the second module, a pooling layer with a step size of 2 is used, and the size of the output feature map is 1/4 of the original image; the convolution kernel with the step size of 2 is used in all of the third to fifth modules, and the final output feature size is 1/32 of the original image.

It can be seen that: the size of the ResNet output feature map is 1/32 of the original image, so that the texture information of the small ground object is seriously lost. In contrast, the method cancels the ResNet last convolution module, uses the dilation convolution in the third and fourth convolution modules, and can reduce the downsampling multiple and maximally reserve the texture characteristics of the small target under the condition of ensuring the reception field of the convolution kernel.

Acquiring a basic feature matrix F of the image by using the improved ResNet backbone network:

F＝{f₁,f₂,...,f_C} (2)

where C represents the dimension of the feature, and the size of the feature is denoted as H '× W'.

Step 2.2, constructing a texture feature extraction structure;

the partial network extracts texture features.

In the feature matrix F obtained in step 2.1, a C-dimensional feature value is set at any position, the feature value is obtained by convolving pixels in a local area at the position with different convolution kernels, and can be regarded as a feature vector of the point, so that the feature matrix F is mapped into t C-dimensional feature vectors:

X＝{x₁,x₂,...,x_t} (3)

the texture is represented by generalizing the concept of the word bag model.

Suppose there is a dictionary D containing K codewords of textural features:

D＝{d₁,d₂,...,d_K} (4)

wherein the code word d_kAnd the feature vector x_iHave the same dimensions. The feature dictionary is used for the dictionary from x_iAnd learning typical texture center features.

The dictionary in the traditional method is kept unchanged after being built and cannot be learned and adjusted from data. Different from the traditional method, the dictionary D is embedded into a deep learning model, and different texture features are learned and adjusted in a supervised learning mode, so that the expression capacity of the dictionary on the texture features is optimized.

Initialization of dictionary DRandomly initialized by uniform distribution with distribution intervals of

Next, a codeword d is constructed_kAnd the feature vector x_iA relation model e between_ikSo that the dictionary D can iteratively adjust the expressive power of the texture features through gradient propagation when propagating in the reverse direction.

Because of possible ambiguities between individual codewords, hard assignment cannot be used to model the relationships between features and codewords. By applying a code to each code word d in soft allocation_kThe weighting factor is set to solve this problem. Because multi-class segmentation is involved and the feature data X contains feature information of a plurality of classes, each code word d is divided according to the thought of a Gaussian mixture model_kE D sets a smoothing factor:

S＝{s₁,s₂,...,s_K} (5)

s_irepresenting a feature vector x_iIs attributed to codeword d_kS is initialized by uniformly distributed random initialization, and the distribution interval is

Benefit from back-propagation to iterate continuously to obtain the optimal parameters.

On the basis, the weight coefficient alpha between different characteristics and different code words is calculated_ik：

Wherein r is_ikExpressing x for input features_iAnd dictionary code word d_kResidual distance between:

r_ik＝x_i-d_k (7)

obtaining a result e of the relation model by the texture feature capturing structure through weight soft distribution_ik：

e_ik＝a_ik*r_ik (8)

At this time e_ikRegarded as a code word d_kFor a feature x_iBy aggregating the code word d_kFull description e of input features X_ikObtaining the description of the code word to the whole feature matrix:

and for the disordered repeated existence of texture elements in the texture features on the image, the spatial arrangement information of the features is ignored in an aggregation mode, and the capture capability of the texture feature distribution information is improved.

And sequentially calculating the description of each code word on the input features X, and further obtaining all unordered descriptions E of the texture feature dictionary D on the input features X:

E＝{e₁,e₂,...,e_K} (10)

wherein the dimension of E is K × C.

At this time, the description information E of the texture dictionary is added to the basic feature matrix. The specific method comprises the following steps: according to the research on channel information in SE-Net, the adjustment coefficient Z of each characteristic channel is automatically obtained in a learning mode, and the global pooling of the first dimension is carried out on E to obtain different channel characteristic graphs f_iAnd for the response value of the texture feature dictionary, taking the value as an adjusting coefficient Z of feature matrix recalibration:

Z＝(z₁,z₂,...,z_c) (11)

calculating a feature matrix F recalibrated according to texture information¹：

F¹＝F*Z (12)

Where denotes the matrix channel multiplication.

The step obtains a feature matrix F which is re-calibrated according to the extracted texture feature information¹The feature contains not only the information in the convolution kernel, but also texture information, thus describing the texture.

Step 2.3, constructing a characteristic matrix denoising structure;

the partial network is a feature matrix denoising structure.

In the deep convolutional network model, after multiple downsampling, the range of the feature value in the original image is large, which may cause interference of different texture feature information in the region and form noise. At this time, the feature matrix F is obtained¹And denoising to obtain a more accurate result of the ground object segmentation.

In the image denoising process containing more repeated texture primitives, the non-local mean denoising method has a good effect. Compared with the denoising method of the smoothing filter, the non-local mean denoising method reconstructs pixel points by calculating the local feature similarity between two points, can better reserve the texture details of an image and avoids the fuzzy texture features brought by local smoothing.

According to the idea, a feature matrix reconstruction type denoising structure in deep learning is designed.

Step 2.2 obtaining a feature matrix F containing texture information¹And a texture feature dictionary D. The dictionary D effectively learns the texture features of the corresponding class by means of supervised learning. The dictionary D is used to reconstruct the feature matrix, highlighting the texture feature information needed by the algorithm.

First, a feature matrix F is calculated¹The similarity between the respective feature vectors and the respective codewords. The similarity between vectors can be obtained by cosine similarity calculation, and in deep learning, the cosine similarity is approximated using dot product similarity while ensuring calculation efficiency. To ensure the accuracy of the measure between similarities, the vectors are normalized before the dot product similarity is calculated. According to the rule of matrix calculation, F¹Performing matrix transposition to obtain F^1TThen, matrix multiplication calculation is carried out with D, and after a similarity matrix W is obtained through a softmax function:

W＝softmax(F^1TD) (13)

the softmax function in the above equation maps the similarity to the (0, 1) interval and makes K codewords to one feature x_iSum of similarity

Is 1. Then, taking the similarity as weight, and carrying out matrix multiplication to obtain a reconstruction characteristic matrix F²：

F²＝DW^T (14)

Step 2.4, constructing an upper sampling structure;

the fourth part is an upsampling structure.

Firstly, the feature matrix F calculated in step 2.2 and step 2.3 is weighted by using the weight parameter¹And F²And (3) connecting to obtain a final feature matrix G:

G＝wf₁*F¹+wf₂*F² (15)

wherein, wf₁And wf₂The characteristic of the backward propagation adjustment parameter of the deep learning network model is utilized to enable the network to automatically adjust the combination of the characteristic information so as to obtain the accurate estimation of the parameter.

Then, performing channel compression on G according to the number n of classes, and explicitly representing the structure of predicting the C channels for the n classes; then, bi-linear interpolation is up-sampled to the original image size.

The step obtains a feature matrix with the same size as the original image, not only contains texture features of a convolution kernel, but also carries out feature denoising, better highlights texture feature information, and removes noise interference possibly brought by convolution calculation after down sampling.

Step 3, deep learning network model training;

and (3) using the sample set and the label set which are manufactured in the step (1) as the input of the network model which is constructed in the step (2), setting the hyper-parameter of the network model, training the model through a gradient descent algorithm and obtaining a stable result.

Step 4, image prediction;

predicting the data set to obtain the probability P of the type i predicted by the pixel point j_ijAnd taking the value with the maximum probability as the type T of the pixel_j:

Since there are a plurality of different segmentation results for the pixels in the overlap region, the synthesis can be performed according to the maximum voting method.

And processing all the images to obtain the type of each pixel, thereby realizing the identification and the segmentation of the ground object type and forming the minimum texture primitive by the connected ground objects.

Step 5, segmentation result post-processing

Counting connected regions formed by the same type of pixels, and judging whether the size of the connected region in the prediction result is smaller than a threshold value e_ConIf the area in the communication area is expanded to a certain extent, if the network contacts other areas during expansion, the network is considered to predict that the category of the communication area is wrong, and the category of the communication area is changed into the category of the other communication areas which are contacted firstly. If the expansion does not contact other regions, the connected region is regarded as an isolated noise point and removed.

The invention adopts the scheme has the advantages that:

(1) the invention uses the deep learning network frame, reduces the information loss of small targets and improves the expression capability of texture information by resetting the down-sampling multiple of the network frame and explicitly setting the texture information extraction structure.

(2) A feature matrix denoising module is added in the deep network model, so that extra noise brought by the method in calculation is reduced, pixel-by-pixel texture expression is realized, and the accuracy of the network model is further improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a distribution diagram of shot points when the unmanned aerial vehicle image is acquired in the embodiment of the invention.

FIG. 3 is a category hierarchy diagram in an embodiment of the present invention.

FIG. 4 is a deep learning network first part: backbone network ResNet schematic.

FIG. 5 is a second part of the deep learning network: the texture feature captures the structural schematic.

FIG. 6 is a third part of the deep learning network: and (5) a characteristic matrix denoising structure schematic diagram.

Fig. 7 shows an image case and a sample label thereof according to the present invention. FIG. 7(a) is an exemplary image of a sample set in accordance with an embodiment of the present invention; FIG. 7(b) is an exemplary diagram of a sample set image tag in an embodiment of the present invention; FIG. 7(c) is a segmentation result of the method herein; FIG. 7(d) is a post-treatment result of the method herein.

Detailed Description

In order to further clarify the disclosure of the present invention, the following further describes the detailed description of the present invention in conjunction with the inventive flow chart of fig. 1 and the embodiments of the present invention. It is to be understood that not all features of an actual implementation are the same as the present embodiment, and that details of the implementation of the present invention may vary depending on actual conditions and objectives in an actual engineering project. In addition, although the present example describes the unmanned aerial vehicle remote sensing image feature segmentation with respect to texture features, it should be understood that the method may also be applied to other image feature understanding and segmentation.

The research area of this embodiment is the dragon tour county in the city of thoroughfare province of Zhejiang province (as shown in FIG. 2), the model of the unmanned aerial vehicle for obtaining the image is DJI PHONOM 4PRO, the sensor is a 1-inch CMOS, the spatial resolution of the obtained image is about 6.25cm, and the size of the image is 5472 × 3682. The flight acquires 86 scenes of images together, and the area of the covered research area is about 1.38km²。

The classification is determined according to the research target, and because the texture features of the same ground object have larger differences in different growth periods, the implementation case takes the same ground object as different types to test, and the research classification system is determined to be divided into 6 major classes and 24 minor classes: typical textures of the respective classes are shown in fig. 7, and the texture features of different classes are greatly different and can be accurately distinguished. The image needs to be cropped to image blocks of 480 x 480 pixels in size in case of limited device computing power. In order to reduce the gap phenomenon generated when the divided image prediction results are combined, the transverse direction and the longitudinal direction are overlapped by 50 pixels.

In the above method, 8256 video blocks with a resolution of 480 × 480 (pixels) were obtained as an experimental data set.

A high-resolution image ground object identification and segmentation method based on texture features comprises the following steps:

step 1, manufacturing a sample set according to a category system;

LabelMe software was used to make label images of the sample set according to the determined study area classification system, as shown in figure 3.

In order to ensure the coverage degree of the sample set on the class textural feature information, firstly, positive samples containing all researched classes and negative samples of other classes are manually selected as the basis of the sample set, and then the remaining data sets are randomly extracted to obtain 2500 sample sets in total.

Step 2, constructing a deep learning network model;

the network model is divided into four parts. The first part is a backbone network and is used for extracting basic features of the image; the second part is a texture feature extraction structure used for extracting the texture features existing in the feature matrix; the third part is a characteristic matrix denoising structure used for denoising the characteristic matrix; and the fourth part is an up-sampling structure, and the feature matrix combined after de-noising is up-sampled to the size of the original image to obtain an image prediction segmentation result.

Step 2.1, constructing a backbone network;

this configuration is shown in fig. 4. And constructing a backbone network based on ResNet, and selecting ResNet101 as the backbone network for extracting the basic characteristics of the image according to the performance and efficiency requirements of the computing equipment.

The structure eliminates the ResNet last convolution module, and uses dilation convolution in the third and fourth convolution modules, so as to reduce down-sampling times under the condition of ensuring convolution kernel receptive field. And setting the expansion convolution parameter of the third convolution module to be 2 and the step size to be 2, and setting the expansion convolution parameter of the fourth convolution module to be 4 and the step size to be 1. The finally obtained basic feature matrix F is 8 times of the original image downsampling size, the number of channels C is 1024, and the shape is 1024 × 60 × 60.

Step 2.2, constructing a texture feature capturing structure;

the second part is a texture feature extraction structure, as shown in fig. 5.Mapping 1 the basic feature matrix F extracted in step 2.1, as shown in fig. 5,part 511, to 360 feature vectors with length 1024.

The texture feature dictionary D is set up, as shown in 512 of FIG. 5, and includes 32 code words D_iAnd a smoothing factor s of 32 codewords_i。

Calculate e using equation 9_kThe texture feature extraction structure assigns a description E of each codeword in the aggregated dictionary for the input features in dimension C by weight soft-distribution, as shown in part 513 in fig. 4.

The texture information obtained using the texture feature extraction structure is globally pooled in a first dimension, and the scaling factor Z is obtained, as shown in 514 of fig. 4.

The adjustment of the basic feature matrix according to the texture features shown inequation 12 is used to obtain a new feature matrix F¹As shown in fig. 4 at 515.

Step 2.3, constructing a characteristic matrix denoising structure;

the third part is a feature matrix denoising structure, as shown in fig. 5.

For F generated in step 2.2¹As in 611 of fig. 6, and the texture dictionary D generated in step 2.2, as in 612 of fig. 6, the similarity between each codeword and each feature vector is calculated usingequation 13. Then, the feature matrix F after denoising is obtained by using theformula 14²As shown inpart 613 of fig. 6.

Step 2.4, constructing an upper sampling structure;

the fourth part is an upsampling structure, and firstly, the feature matrices calculated in the step 2.2 and the step 2.3 are connected by using the weight parameters, and a final feature matrix G is obtained by using aformula 15.

The characteristics of back propagation adjustment parameters of the deep learning network model are utilized, so that the network automatically adjusts the combination of the characteristic information to obtain a result with higher precision.

G is then compressed to 24 (total number of classes) channels. The displayed configuration sets the structure of each channel predicted for each category. Then, bilinear interpolation is carried out to up-sample the original image size to 480 × 480.

Step 3, deep learning network model training;

and (3) using the sample set and the label set manufactured in the step (1) as the input of the network model constructed in the step (2), and setting hyper-parameters: the learning rate was 0.001, total batch 100, eachbatch size 16, momentum 0.9, weight decay 0.0001.

Predicting a result through a set model frame, then calculating loss values of an actual label and the predicted result according to a loss function, then using a reverse gradient propagation algorithm, and adjusting model parameters according to a learning rate. And continuously iteratively training the deep network model until the result obtained by the loss function tends to be stable, and the network is converged at the moment, so that the accuracy of the model in the sample set is calculated. And adjusting the hyper-parameter by using a general method according to different precision requirements to obtain the current optimal model.

Step 4, image prediction;

predicting all manufactured data sets, and reserving the probability p of the type i predicted by each pixel point j_ij。

According to the overlapped cutting, the point is predicted repeatedly at most four times, and the result with the largest occurrence number is used as the final segmentation result.

FIG. 7(a) is an input image of the present embodiment; FIG. 7(b) is a label of an input image in an embodiment of the present invention; fig. 7(c) shows the segmentation result of the present embodiment.

Step 5, post-processing results;

the minimum texel size is 5 x 5 (pixels) calculated from the class hierarchy.

Expanding the area with the connected region smaller than the size by 1.5 times, and determining the point as an isolated noise point to be erased if other connected regions are not encountered in the expansion process, as shown in 710 and 711 in fig. 7 (c); if other connected regions are contacted, the point is changed to the category of the connected region contacted first, as shown in 712 of FIG. 7(c) and 713 of FIG. 7 (d).

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

Translated fromChinese

1.一种基于纹理特征的高分辨率影像地物识别与分割方法，包括以下步骤：1. A high-resolution image feature recognition and segmentation method based on texture features, comprising the following steps:

步骤1，根据类别体系制作样本集；Step 1, make a sample set according to the category system;

根据研究目标，确定研究区的地物分类体系，假设研究区n个不同的类别CL：According to the research objectives, determine the classification system of the ground objects in the study area, assuming that there are n different categories CL in the study area:

CL＝{cl₁,cl₂,...,cl_n} (1)CL={cl₁ ,cl₂ ,...,cl_n } (1)

根据上述类别体系制作样本集，样本包含所有类别的正样本；Create a sample set according to the above category system, and the sample contains positive samples of all categories;

样本以多边形方式表示该类别地物所在区域，并使用cl_i∈CL标识其类别；The sample represents the area where the feature of this category is located in polygonal way, and uses cl_i ∈ CL to identify its category;

样本数目需要满足训练要求，如果样本数目不足，则进行样本增强，以提高样本数量；The number of samples needs to meet the training requirements. If the number of samples is insufficient, sample enhancement is performed to increase the number of samples;

步骤2，构造深度学习网络模型；Step 2, construct a deep learning network model;

将网络模型分为四个部分：第一部分为骨干网络，用于提取图像的基础特征；第二部分为纹理特征提取结构；第三部分为特征矩阵去噪结构；第四部分为上采样结构，将去噪后的特征矩阵上采样到原图像大小并获得图像类别及其分割结果；The network model is divided into four parts: the first part is the backbone network, which is used to extract the basic features of the image; the second part is the texture feature extraction structure; the third part is the feature matrix denoising structure; the fourth part is the upsampling structure, Upsample the denoised feature matrix to the original image size and obtain the image category and its segmentation results;

步骤2.1，构造骨干网络；Step 2.1, construct the backbone network;

提取图像的基础特征，基于ResNet构造骨干网络；Extract the basic features of the image, and construct the backbone network based on ResNet;

ResNet由五个卷积模块组成，第一个模块中使用了一次步长为2的卷积核，输出的特征图尺寸为原图像的1/2；第二个模块中使用了一次步长为2的池化层，输出的特征图尺寸为原图像的1/4；第三模块到第五个模块中都使用了一次步长为2的卷积核，最终输出的特征图尺寸为原图像的1/32；ResNet consists of five convolution modules. The first module uses a convolution kernel with a step size of 2, and the output feature map size is 1/2 of the original image; the second module uses a step size of 1/2. 2 pooling layer, the output feature map size is 1/4 of the original image; a convolution kernel with a stride of 2 is used in the third module to the fifth module, and the final output feature map size is the original image. 1/32 of ;

取消了ResNet最后一个卷积模块，并在第三和第四卷积模块中使用膨胀卷积；Removed the last convolution module of ResNet and used dilated convolution in the third and fourth convolution modules;

使用改进的ResNet骨干网络获取图像的基础特征矩阵F：Obtain the underlying feature matrix F of the image using the modified ResNet backbone network:

F＝{f₁,f₂,...,f_C} (2)F={f₁ ,f₂ ,...,f_C } (2)

其中，C表示特征的维数，特征的大小记为H′×W′；Among them, C represents the dimension of the feature, and the size of the feature is denoted as H′×W′;

步骤2.2，构造纹理特征提取结构；Step 2.2, construct a texture feature extraction structure;

在步骤2.1获得的特征矩阵F中，任意位置有C维特征值，该特征值由不同卷积核卷积该位置局部区域的像素得到，可以视为该点的特征向量，由此将特征矩阵F映射为t个C维的特征向量：In the feature matrix F obtained in step 2.1, there is a C-dimensional eigenvalue at any position. The eigenvalue is obtained by convolving the pixels in the local area of the position with different convolution kernels, which can be regarded as the eigenvector of the point. F maps to t C-dimensional feature vectors:

X＝{x₁,x₂,...,x_t} (3)X={x₁ ,x₂ ,...,x_t } (3)

通过泛化词包模型思想，来表示纹理；Represent texture by generalizing the idea of bag of words model;

假设有一个含K个码字纹理特征词典D：Suppose there is a texture feature dictionary D containing K codewords:

D＝{d₁,d₂,...,d_K} (4)D={d₁ ,d₂ ,...,d_K } (4)

其中，码字d_k与特征向量x_i具有相同维度；该特征词典用于从x_i中学习出典型纹理中心特征；Among them, the codeword d_k and the feature vector_xi have the same dimension; the feature dictionary is used to learn the typical texture center feature from_xi ;

该词典D嵌入到深度学习模型之中，通过监督学习方式对不同纹理特征进行学习并调整，从而优化词典对纹理特征的表达能力；The dictionary D is embedded in the deep learning model, and different texture features are learned and adjusted by means of supervised learning, thereby optimizing the expression ability of the dictionary for texture features;

词典D的初始化使用均匀分布随机初始化，分布区间为

The initialization of dictionary D is randomly initialized using uniform distribution, and the distribution interval is

接下来构建码字d_k与特征向量x_i之间的关系模型e_ik，使得词典D可以在反向传播时通过梯度传播，迭代调整对纹理特征的表达能力；Next, construct the relationship model e_ik between the codeword d_k and the feature vector x_i , so that the dictionary D can be propagated through the gradient during backpropagation, and iteratively adjust the expression ability of texture features;

由于各个码字之间可能存在歧义，不能使用硬分配方式来构建特征和码字之间的关系模型；软分配中通过对每个码字d_k设定权重系数以解决此问题；由于涉及多类别分割，特征数据X中含有多个类别的特征信息，因此根据高斯混合模型的思路为每个码字d_k∈D设定一个平滑因子：Due to the possible ambiguity between each codeword, the hard allocation method cannot be used to construct the relationship model between features and codewords; in soft allocation, the weight coefficient is set for each codeword d_k to solve this problem; Category segmentation, the feature data X contains feature information of multiple categories, so according to the idea of Gaussian mixture model, a smoothing factor is set for each codeword d_k ∈ D:

S＝{s₁,s₂,...,s_K} (5)S={s₁ ,s₂ ,...,s_K } (5)

s_i表示特征向量x_i的类别归属为码字d_k的概率，S的初始化使用均匀分布随机初始化，分布区间为

受益于反向传播不断迭代，以获得最佳参数；s_i represents the probability that the category of the feature vector x_i belongs to the codeword d_k , the initialization of S is randomly initialized with a uniform distribution, and the distribution interval is

Benefit from the continuous iteration of backpropagation to obtain the best parameters;

在此基础上，计算出不同特征与不同码字之间的权重系数α_ik：On this basis, the weight coefficient α_ik between different features and different codewords is calculated:

其中，r_ik为输入的特征表达x_i与词典码字d_k之间的残差距离：Among them,_rik is the residual distance between the input feature expression_xi and the dictionary codeword_dk :

r_ik＝x_i-d_k (7)r_ik =_xi -d_k (7)

纹理特征捕获结构通过权重软分配，获得关系模型的结果e_ik：The texture feature capture structure obtains the result e_ik of the relational model through weight soft assignment:

e_ik＝a_ik*r_ik (8)e_ik = a_ik *r_ik (8)

此时e_ik视为一个码字d_k对一个特征x_i的描述，通过聚合该码字d_k对输入特征X的全部描述e_ik获得该个码字对整个特征矩阵的描述：At this time, e_ik is regarded as the description of a feature x_i by a code word d_k , and the description of the entire feature matrix by the code word is obtained by aggregating all the description e_ik of the code word d_k on the input feature X:

针对纹理特征中纹理基元在图像上无序重复的存在，通过聚合的方式忽略特征的空间排列信息，提升对纹理特征分布信息的捕获能力；For the existence of disordered repetition of texture primitives on the image in texture features, the spatial arrangement information of features is ignored by aggregation, and the ability to capture the distribution information of texture features is improved;

依次计算各个码字对输入特征X的描述，进而获得纹理特征词典D对输入特征X的全部无序描述E：Calculate the descriptions of the input features X by each codeword in turn, and then obtain all the disordered descriptions E of the input features X from the texture feature dictionary D:

E＝{e₁,e₂,...,e_K} (10)E={e₁ ,e₂ ,...,e_K } (10)

其中，E的维度为K×C；Among them, the dimension of E is K × C;

此时将获取到纹理词典的描述信息E添加到基础特征矩阵中；具体方法为：根据SE-Net中对通道信息的研究，通过学习的方式自动获取每个特征通道的调整系数Z，通过对E进行第一个维度的全局池化，获取到不同通道特征图f_i对纹理特征词典的响应值，将此值作为特征矩阵重标定的调整系数Z：At this time, the description information E of the obtained texture dictionary is added to the basic feature matrix; the specific method is: according to the research on channel information in SE-Net, the adjustment coefficient Z of each feature channel is automatically obtained by learning, and the adjustment coefficient Z of each feature channel is automatically obtained by learning. E performs global pooling of the first dimension, and obtains the response value of different channel feature maps f_i to the texture feature dictionary, and uses this value as the adjustment coefficient Z for the recalibration of the feature matrix:

Z＝(z₁,z₂,...,z_c) (11)Z=(z₁ ,z₂ ,...,z_c ) (11)

计算根据纹理信息重标定的特征矩阵F¹：Calculate the feature matrix F¹ rescaled according to the texture information:

F¹＝F*Z (12)F¹ =F*Z (12)

其中，*表示矩阵通道相乘；Among them, * represents the matrix channel multiplication;

该步骤获得根据提取到的纹理特征信息重新标定的特征矩阵F¹，该特征不仅包含卷积核内的信息，也包含了纹理信息，从而描述纹理；In this step, a feature matrix F¹ that is re-calibrated according to the extracted texture feature information is obtained, and the feature includes not only the information in the convolution kernel, but also the texture information, so as to describe the texture;

步骤2.3，构造特征矩阵去噪结构；Step 2.3, construct a feature matrix denoising structure;

在深度卷积网络模型之中，多次下采样之后特征值表示了在原影像中的范围较大；此时对获得特征矩阵F¹进行去噪，以获得地物分割更精确的结果；In the deep convolutional network model, after multiple downsampling, the eigenvalues represent a larger range in the original image; at this time, denoising is performed^on the obtained feature matrix F1 to obtain more accurate ground object segmentation results;

设计深度学习中特征矩阵重建式去噪结构；Design feature matrix reconstruction denoising structure in deep learning;

步骤2.2获得含有纹理信息的特征矩阵F¹和纹理特征词典D；利用监督学习的方式，词典D有效的学习到相应类别的纹理特征；使用词典D来重建特征矩阵，突出算法所需的纹理特征信息；Step 2.2 Obtain feature matrix F¹ containing texture information and texture feature dictionary D; using supervised learning, dictionary D effectively learns the texture features of the corresponding category; use dictionary D to reconstruct the feature matrix and highlight the texture features required by the algorithm information;

首先，计算特征矩阵F¹中各个特征向量和各个码字之间的相似性；向量之间的相似性可由余弦相似度计算获得，在深度学习中，在保证计算效率的情况下使用点积相似度来近似余弦相似度；为了保证相似性之间的度量的准确性，要在计算点积相似度之前对向量进行归一化；根据矩阵计算规则，将F¹进行矩阵转置得到

然后与D进行矩阵乘法计算，在通过softmax函数得出相似度矩阵W：First, the similarity between each feature vector and each codeword in the feature matrix F¹ is calculated; the similarity between vectors can be obtained by calculating the cosine similarity. In deep learning, dot product similarity is used while ensuring computational efficiency. to approximate the cosine similarity; in order to ensure the accuracy of the measurement between the similarities, the vector should be normalized before calculating the dot product similarity; according to the matrix calculation rule, the matrix transpose of F¹ is obtained.

Then perform matrix multiplication with D, and obtain the similarity matrix W through the softmax function:

上式中的softmax函数将相似度映射到(0，1)区间，并使得K个码字对一个特征x_i相似度求和

为1；之后以相似度作为权重，进行矩阵相乘得到重建特征矩阵F²：The softmax function in the above formula maps the similarity to the (0, 1) interval, and makes K codewords sum the similarity of a feature x_i

is 1; then the similarity is used as the weight, and the matrix is multiplied to obtain the reconstructed feature matrix F² :

F²＝DW^T (14)F² =DW^T (14)

步骤2.4，构造上采样结构；Step 2.4, construct an upsampling structure;

首先使用权重参数将步骤2.2和步骤2.3计算出的特征矩阵F¹和F²进行连接得到最终的特征矩阵G：First, use the weight parameter to connect the feature matrices F¹ and F² calculated in steps 2.2 and 2.3 to obtain the final feature matrix G:

G＝wf₁*F¹+wf₂*F² (15)G=wf₁ *F¹ +wf₂ *F² (15)

其中，wf₁和wf₂是可学习参数，利用深度学习网络模型的反向传播调整参数的特性，让网络自动的调整特征信息的组合，以获得参数的准确估计；Among them, wf₁ and wf₂ are learnable parameters, and the characteristics of the parameters are adjusted by back-propagation of the deep learning network model, so that the network can automatically adjust the combination of feature information to obtain accurate estimation of parameters;

然后,根据类别个数n对G进行通道压缩，显式地表示C个通道针对n个类别进行预测的结构；然后，双线性插值上采样到原图像大小尺寸；Then, channel compression is performed on G according to the number of categories n to explicitly represent the structure of C channels for predicting n categories; then, bilinear interpolation is up-sampled to the original image size;

该步骤得到与原始影像大小相同的特征矩阵，不仅包含卷积核的纹理特征，也进行特征去噪，更好的突出纹理特征信息，去除了下采样后卷积计算可能带来的噪声干扰；In this step, a feature matrix of the same size as the original image is obtained, which not only contains the texture features of the convolution kernel, but also performs feature denoising to better highlight the texture feature information and remove the possible noise interference caused by the convolution calculation after downsampling;

步骤3，深度学习网络模型训练；Step 3, deep learning network model training;

使用步骤1中制作的样本集和标签集作为步骤2构建好的网络模型的输入，设定网络模型超参，通过梯度下降算法，训练模型并获得稳定结果；Use the sample set and label set made in step 1 as the input of the network model constructed in step 2, set the network model hyperparameters, train the model and obtain stable results through the gradient descent algorithm;

步骤4，图像预测；Step 4, image prediction;

对数据集进行预测，获得像素点j预测出的类别i的概率P_ij，并将概率最大的值作为该像素的类型T_j:Predict the data set, obtain the probability P_ij of the category i predicted by the pixel j, and use the value with the highest probability as the type T_j of the pixel:

由于重叠区域内的像素存在多个不同的分割结果，可以按照最大投票方法进行合成；Since the pixels in the overlapping area have multiple different segmentation results, they can be synthesized according to the maximum voting method;

通过对全部影像进行处理，获得每个像素的类型，从而实现地物类别的识别与分割，并将连通的地物构成最小的纹理基元；By processing all the images, the type of each pixel is obtained, so as to realize the recognition and segmentation of the ground object category, and the connected ground objects constitute the smallest texture primitive;

步骤5，分割结果后处理Step 5, post-processing of segmentation results

统计相同类型像素构成的连通区域，对预测结果中连通区尺寸小于阈值e_Con的区域进行一定的膨胀，如果膨胀时接触到其它区域，则认为网络预测该连通区的类别出错，将该连通区类别改为最先接触到的其他连通区的类别；如果膨胀时未接触到其他区域，则将该连通区视为孤立的噪声点进行去除。Count the connected areas formed by the same type of pixels, and expand the area where the size of the connected area is smaller than the threshold e_Con in the prediction result. If it touches other areas during expansion, it is considered that the network predicts the type of the connected area wrong, and the connected area is The category is changed to the category of other connected regions that are first touched; if no other regions are touched during expansion, the connected regions are treated as isolated noise points and removed.