Disclosure of Invention
The invention provides a multi-class vegetable seedling identification method based on a lightweight two-stage detection model, which aims to improve the detection precision and speed of vegetable seedlings in a natural environment. The lightweight two-stage detection model adopts mixed depth separation convolution as a preposed basic network to operate the input image, so that the image feature extraction speed and efficiency are improved; introducing a characteristic pyramid network to fuse different levels of characteristic information of a preposed basic network, and enhancing the identification precision of a detection model on a multi-scale target; by compressing the number of the network channel dimensions of the detection head and the number of the full connection layers, the scale of the model parameters and the calculation complexity are reduced.
In a first aspect, the invention provides a method for identifying multi-class vegetable seedlings based on a lightweight two-stage detection model, which comprises the following steps:
s01, acquiring multi-category vegetable seedling image data sets, and performing data enhancement on the image data sets;
s02, labeling the enhanced data set, and dividing the labeled data set into a training set, a verification set and a test set;
s03, building a lightweight two-stage detection model on a TensorFlow deep learning framework, designing a mixed deep separation convolutional neural network as a preposed basic network, fusing different levels of feature information of the preposed basic network by adopting a feature pyramid network, and compressing the network channel dimension and the number of full connection layers of a detection head;
s04, initializing the parameters of the lightweight two-stage detection model, and inputting a training set into the detection model to train based on a random gradient descent method;
and S05, inputting the image to be recognized into the detection model after training is finished, and outputting the type and position information of the vegetable seedling.
Optionally, the acquiring an image dataset of the multi-class vegetable seedlings in step S01, and performing data enhancement on the image dataset specifically includes:
(1.1) enabling the camera and the horizontal direction of crop rows to form an included angle of 80-90 degrees, enabling the camera to be 80cm away from the ground, and acquiring images of various vegetable seedlings under different weather conditions, different illumination directions and different environmental backgrounds to construct an image data set;
(1.2) data enhancing the image dataset by geometric transformation and color transformation.
Optionally, in the step S02, the enhancing data set is marked, and the marked data set is divided into a training set, a verification set, and a test set, which specifically includes:
(2.1) adopting labeling software to label the type and the position of the vegetable seedling in the enhanced data set;
and (2.2) randomly splitting the labeling data set into a training set, a verification set and a test set according to the proportion of 7:2: 1.
Optionally, in step S03, on the tensrflow deep learning framework, a lightweight two-stage detection model is built, a hybrid deep separation convolutional neural network is designed as a pre-base network, a feature pyramid network is adopted to fuse different levels of feature information of the pre-base network, and the network channel dimension and the number of full connection layers of the detection head are compressed, which specifically includes:
(3.1) fusing a plurality of convolution kernels with different sizes into a single depth separable convolution operation to form a mixed depth separable convolution neural network, and taking the mixed depth separable convolution neural network as a preposed basic network to perform feature acquisition on an input image;
(3.2) introducing a characteristic pyramid network to fuse different levels of characteristics of the preposed basic network, and inputting the fused characteristic diagram into a regional suggestion network to generate a series of sparse prediction frames;
and (3.3) operating the output characteristics of the final stage of the mixed deep separation convolutional neural network by using asymmetric convolution in a detection head network to generate a characteristic diagram with less channel dimensions, comprehensively accessing the characteristic diagram and a prediction frame into 1 full-connection layer to obtain global characteristics of a detection target, and finishing target classification and position prediction based on 2 parallel branches.
Optionally, the initializing the parameters of the lightweight two-stage detection model in step S04, inputting a training set to the detection model, and training the detection model based on a stochastic gradient descent method, specifically including:
(4.1) using a pre-trained mixed deep separation convolutional neural network weight parameter model for the pre-feature extraction network, and randomly initializing the rest layers by using Gaussian distribution with the mean value of 0 and the standard deviation of 0.01;
(4.2) setting hyper-parameters related to model training, and training by adopting a multi-task loss function as a target function based on a random gradient descent method;
and (4.3) calculating loss functions of input samples by using an online difficult sample mining strategy in the training process, sequencing the loss functions from large to small, and updating model weight parameters by back propagation training of difficult samples with larger loss functions of the first 1 percent screened.
In a second aspect, the present invention further provides a light-weight two-stage detection model-based multi-class vegetable seedling recognition system, including:
the image acquisition and enhancement module is used for acquiring image data sets of the multi-category vegetable seedlings and performing data enhancement on the image data sets; the image labeling and classifying module is used for labeling the enhanced data set and dividing the labeled data set into a training set, a verification set and a test set;
the detection model building module is used for building a lightweight two-stage detection model on a TensorFlow deep learning framework, designing a mixed deep separation convolutional neural network as a preposed basic network, fusing different levels of characteristic information of the preposed basic network by adopting a characteristic pyramid network, and compressing the network channel dimension and the number of full connection layers of the detection head;
the detection model training module is used for initializing the lightweight two-stage detection model parameters and inputting a training set into the detection model to train the detection model based on a random gradient descent method;
and the detection result output module is used for inputting the images to be recognized into the detection model after the training is finished and outputting the type and the position information of the vegetable seedlings.
Optionally, the image acquisition enhancing module specifically includes:
the image acquisition unit is used for enabling the camera to form an included angle of 80-90 degrees with the horizontal direction of the crop row and to be about 80cm away from the ground, and acquiring various vegetable seedling images under different weather conditions, different illumination directions and different environmental backgrounds to construct an image data set; and the image enhancement unit is used for performing data enhancement on the image data set through geometric transformation and color transformation.
Optionally, the image labeling and classifying module specifically includes:
the labeling unit is used for labeling the category and the position of the vegetable seedling in the enhanced data set by adopting labeling software;
and the classification unit is used for randomly splitting the labeling data set into a training set, a verification set and a test set according to the proportion of 7:2: 1.
Optionally, the detection model building module specifically includes:
the pre-basic network unit is used for fusing a plurality of convolution kernels with different sizes into a single depth separable convolution operation to form a mixed depth separation convolution neural network, and the mixed depth separation convolution neural network is used as a pre-basic network to carry out feature acquisition on an input image;
the feature information fusion unit is used for introducing a feature pyramid network to fuse different levels of features of the preposed basic network, and inputting the fused feature map into a regional suggestion network to generate a series of sparse prediction frames;
and the lightweight detection head unit is used for calculating the output characteristics of the mixed deep separation convolutional neural network at the last stage by using asymmetric convolution in the detection head network to generate a characteristic diagram with less channel dimensions, comprehensively accessing the characteristic diagram and a prediction frame into 1 full-connection layer to obtain global characteristics of a detection target, and finishing target classification and position prediction based on 2 parallel branches.
Optionally, the detection model training module specifically includes:
the initialization unit is used for using a pre-trained mixed deep separation convolutional neural network weight parameter model for the pre-feature extraction network, and randomly initializing the rest layers by using Gaussian distribution with the mean value of 0 and the standard deviation of 0.01;
the training unit is used for setting hyper-parameters related to model training and training on the basis of a random gradient descent method by adopting a multi-task loss function as a target function;
and the difficult sample mining unit is used for calculating the loss function of the input sample by utilizing an online difficult sample mining strategy in the training process, sorting the loss function of the input sample according to the sequence from large to small, and updating the model weight parameter by back propagation training of the difficult samples with the larger loss function of the first 1 percent.
According to the technical scheme, the method comprises the following steps: the invention provides a method and a system for identifying multi-class vegetable seedlings based on a lightweight two-stage detection model, which have the following advantages:
firstly, a mixed depth separation convolutional neural network is used as a preposed basic network to extract the characteristics of an input image, so that the calculated characteristic image pixels have different receptive fields, and the image characteristic extraction speed and efficiency are effectively improved;
fusing different levels of features of the preposed basic network by adopting a feature pyramid network, wherein the fused feature graph has enough resolution and stronger semantic information, and the detection precision of the multi-scale target can be enhanced;
thirdly, the detection head network is designed in a light weight mode, redundant parameters are reduced by compressing the number of network channel dimensions and the number of full connection layers, the calculated amount of the model is reduced, and the reasoning speed of the model is improved;
and fourthly, the multi-category vegetable seedling identification method and system based on the lightweight two-stage detection model have high identification precision and high reasoning speed, and can be applied to embedded agricultural mobile equipment with limited computing capacity and storage resources.
Detailed Description
The following embodiments are described in detail with reference to the accompanying drawings, and the following embodiments are only used to clearly illustrate the technical solutions of the present invention, and should not be used to limit the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for identifying multi-class vegetable seedlings based on a lightweight two-stage detection model according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
101. acquiring multi-category vegetable seedling image data sets, and performing data enhancement on the image data sets;
102. labeling the enhanced data set, and dividing the labeled data set into a training set, a verification set and a test set;
103. building a lightweight two-stage detection model on a TensorFlow deep learning framework, designing a mixed deep separation convolutional neural network as a preposed basic network, fusing different levels of characteristic information of the preposed basic network by adopting a characteristic pyramid network, and compressing the network channel dimension and the number of full connection layers of a detection head;
104. initializing parameters of the lightweight two-stage detection model, and inputting a training set into the detection model to train based on a random gradient descent method;
105. and after the training is finished, inputting the image to be recognized into the detection model, and outputting the type and position information of the vegetable seedling.
Thestep 101 comprises the following specific steps:
(1.1) enabling the camera and the horizontal direction of crop rows to form an included angle of 80-90 degrees, enabling the camera to be 80cm away from the ground, and acquiring images of various vegetable seedlings under different weather conditions, different illumination directions and different environmental backgrounds to construct an image data set;
(1.2) data enhancing the image dataset by geometric transformation and color transformation;
for example, in the present embodiment, a Matlab tool is used for data enhancement. Geometric transformation: randomly dividing an original image data set into 2 parts, carrying out image rotation on one part, and selecting a rotation angle of-20 degrees, -5 degrees, 5 degrees and 20 degrees to generate a new image; and the other part randomly performs mirror image turning, horizontal turning and vertical turning on the image. Color transformation: the original image is transformed from RGB color space to HVS color space, and the brightness (Value) and Saturation (Saturation) of the image are randomly adjusted, wherein the brightness adjustment Value is 0.8 times, 0.9 times, 1.1 times and 1.2 times of the original Value, and the Saturation adjustment Value is 0.85 times, 0.95 times, 1.05 times and 1.15 times of the original Value. And combining the original image data set, the geometric transformation data set and the color transformation data set to form an enhanced image data set.
Thestep 102 comprises the following specific steps:
(2.1) adopting labeling software to label the type and the position of the vegetable seedling in the enhanced data set;
for example, LabelImg software is used for image annotation in this embodiment. Firstly, double-clicking LabelImg software to enter an operation interface, and opening a folder (Open Dir) where an image to be marked is located; then, setting a marked image storage directory (Change Save Dir), marking a target area in the current image by using Create \ nRectBox and setting a class name; finally, the labeled image (Save) is saved, and the Next image is clicked for marking (Next). The marked image is generated under the condition of saving a file path, the name of the xml file is consistent with the name of the marked image, and the file comprises information such as the name, the path, the target quantity, the category, the size and the like of the marked image;
and (2.2) randomly splitting the labeling data set into a training set, a verification set and a test set according to the proportion of 7:2: 1.
Thestep 103 comprises the following specific steps:
(3.1) fusing a plurality of convolution kernels with different sizes into a single depth separable convolution operation to form a mixed depth separable convolution neural network, and taking the mixed depth separable convolution neural network as a preposed basic network to perform feature acquisition on an input image, wherein the specific process comprises the following steps:
in this embodiment, the deep learning framework selects TensorFlow, and performs program design based on Python language on a Windows 10 operating system, and the design idea of the hybrid deep separation convolutional neural network is as follows: let the input feature map be X
(h,w,c)H represents the height of the feature map, w represents the width, c represents the number of channels, and the feature map is divided into g groups of sub-feature maps along the channel direction
c
s(
s 1,2.. g) represents the number of channels of the s-th group of sub-feature maps, and c
1+c
2+...+c
gC. Establishing g groups of different-size depth convolution kernels
m denotes a channel multiplier, k
t×k
t(t ═ 1,2.. g) denotes the t-th group convolution kernel size. And (3) operating the t group of input sub-feature maps and the corresponding depth convolution kernels to obtain a t group of output sub-feature maps, wherein the specific definition formula is as follows:
wherein x represents the characteristic image pixel line number, y represents the characteristic image pixel column number, z
tRepresenting the number of channels of the t-th group of output feature maps, h representing the height of the input feature map, w representing the width of the input feature map, i representing the row number of the convolution kernel elements, j representing the column number of the convolution kernel elements,
showing the output sub-feature map of the t-th group,
representing the t-th group of input sub-feature maps,
representing a t-th set of deep convolution kernels;
according to the calculation result of the formula, all the output sub-feature graphs are spliced in the channel dimension in an addition mode to obtain a final output feature graph, and the calculation formula is as follows:
wherein z represents the number of channels of the output characteristic diagram, and z is equal to z1+...+zg,Yx,y,zRepresenting the spliced output characteristic diagram;
the structure of the mixed deep separation convolutional neural network in this embodiment is shown in fig. 2, the maximum grouping number g of the feature map is 5, each group has the same number of channels, the size of the corresponding deep convolutional kernel is {3 × 3, 5 × 5,7 × 7,9 × 9,11 × 11}, the feature map is grouped and then operated with convolutional kernels of different sizes, and then the result is spliced to obtain an output. FIG. 2 is a graph obtained by dividing the convolution neural network into 5 stages (stages) according to the size of a feature map, wherein the feature map with the same size is the same Stage, and the scale ratio of the feature map in the adjacent stages is 2;
(3.2) introducing a feature pyramid network to fuse different levels of features of the preposed basic network, inputting the fused feature map into a regional suggestion network to generate a series of sparse prediction frames, wherein the specific process comprises the following steps:
in the embodiment, a feature pyramid network is merged into the hybrid depth separation convolutional neural network, as shown in fig. 3. In fig. 3, the mixed depth separation convolution sequentially generates feature maps of different stages in the bottom-up order, wherein x in Stage x/y (x is 1,2,3,4, 5; y is 2,4,8,16,32) represents the number of stages in which the feature maps are located, and y represents the reduction factor of the feature map size relative to the input image at this Stage. The stages 2-5 are respectively input to the feature pyramid network after being subjected to 1 × 1 convolution operation, wherein the 1 × 1 convolution has the function of keeping the number of channels input to the feature pyramid network by each Stage feature diagram consistent. And the feature pyramid network unit performs up-sampling on the input high-level feature map according to the top-down sequence to enlarge the resolution, and then performs fusion with the adjacent low-level features in an addition mode. On one hand, the fused feature graph is input into a subsequent network for predictionReasoning, on the other hand, continues to fuse with the underlying feature map through upsampling. The mixed depth separation convolution stages 2-5 correspond to the P2-P5 levels of the feature pyramid network, and P6 is obtained by downsampling Stage5 and is used for generating a prediction box in the area suggestion network without participating in fusion operation. Each level of { P2, P3, P4, P5, P6} is responsible for information processing of a single scale, and corresponds to {16 }2,322,642,1282, 25625 scale prediction frames, each prediction frame has 3 length-width ratios of {1:1,1:2,2:1}, and the prediction frames totally comprise 15 prediction frames for predicting the target object and the background;
(3.3) in a detection head network, computing the output characteristics of the mixed deep separation convolutional neural network at the last stage by using asymmetric convolution to generate a characteristic diagram with less channel dimensions, comprehensively accessing the characteristic diagram and a prediction frame into 1 full-connection layer to obtain global characteristics of a detection target, and finishing target classification and position prediction based on 2 parallel branches, wherein the specific process comprises the following steps:
in this embodiment, the lightweight detection head unit is constructed by compressing the network channel dimension and the parameter scale, and the specific design method is as follows: generating a feature map of an alpha multiplied by p channel by adopting a large-size asymmetric convolution aiming at a feature map output by a final stage of a mixed deep separation convolutional neural network, wherein alpha is a number which is irrelevant to a category and has a small numerical value, the value of alpha is 10, the value of p multiplied by p is equal to the number of grids after pooling of a candidate area, the value of p multiplied by p is 49, and a feature map of a 490 channel is obtained through calculation; then, introducing ROI Align operation to pool the feature information corresponding to the prediction frames with different sizes to generate a feature map with a fixed size, wherein the ROI Align operation acquires the numerical value of a pixel point with coordinates as floating point numbers by using a bilinear difference method, and the whole feature aggregation process is converted into a continuous operation; finally, accessing 1 full-connection layer to obtain global characteristics of the detected target, and completing target classification and position prediction based on 2 parallel branches; as used herein, large-scale asymmetric convolution consists of 1 × 15 and 15 × 12 convolution kernels; FIG. 4 is a schematic block diagram of a lightweight two-stage target detection model.
Thestep 104 comprises the following specific steps:
(4.1) using a pre-trained mixed deep separation convolutional neural network weight parameter model for the pre-feature extraction network, and randomly initializing the rest layers by using Gaussian distribution with the mean value of 0 and the standard deviation of 0.01;
(4.2) setting hyper-parameters related to model training, and training by adopting a multi-task loss function as a target function based on a random gradient descent method, wherein the specific process comprises the following steps:
the momentum factor is 0.9, and the weight attenuation coefficient is 5X 10-4The initial learning rate is 0.002, the attenuation rate is 0.9, the attenuation is 1 time after every 2000 iterations, the accuracy rate of the training model is tested on the verification set, and the total iteration number of the model training is 50000;
secondly, in the training process, a multi-task loss function is adopted to complete the confidence degree discrimination and the position regression of the target type, and the method is specifically defined as follows:
LTotal=LRPN(pl,al)+LHEAD(p,u,o,g)
wherein
LHEAD(p,u,o,g)=Lcls(p,u)+λ'[u≥1]LDIOU(o,g)
The loss function of the embodiment comprises two parts, namely area recommendation network loss and detection head loss, wherein each part comprises classification loss and position regression loss. In the formula, L
TotalTo detect model loss, L
RPNSuggesting network loss for a region, L
HEADTo detect head network loss, l is the anchor frame index, p
lPredict probability for the first anchor frame two classes, p
l*For the first anchor frame discriminationValue of a
lFor the prediction box corresponding to the ith anchor box,
is a real frame corresponding to the ith anchor frame, p is a prediction class probability, u is a real class label, lambda' are weight parameters, L
clsTo classify the loss, N
clsNumber of anchor frames to sample, N
regFor sampling positive and negative sample numbers, o is a prediction box output by the area recommendation network, g is a real box corresponding to the prediction box, and L
DIOUFor Distance-based cross-over ratio (DIoU) loss, A is a prediction box, B is a real box, c is A, B is the minimum bounding box diagonal length, rho (·) is Euclidean Distance calculation, A is
ctrTo predict the frame center point coordinates, B
ctrThe coordinate of the center point of the real frame is IoU (intersection over Union), and the intersection ratio of the prediction frame and the real frame is IoU;
and (4.3) calculating loss functions of input samples by using an online difficult sample mining strategy in the training process, sequencing the loss functions from large to small, and updating model weight parameters by back propagation training of difficult samples with larger loss functions of the first 1 percent screened.
Thestep 105 comprises the following specific steps:
(5.1) setting a category confidence threshold value to be 0.5 and setting a threshold value of intersection and union ratio to be 0.5 in the trained detection model;
and (5.2) inputting the image to be recognized into the trained detection model to obtain a multi-class vegetable seedling recognition result, wherein the recognition result comprises a target class label, a class confidence coefficient and a target position frame.
Fig. 5 is a schematic structural diagram of a multi-class vegetable seedling identification system based on a lightweight two-stage detection model according to an embodiment of the present invention, and as shown in fig. 5, the system includes:
the imageacquisition enhancing module 501 is used for acquiring image data sets of multi-category vegetable seedlings and enhancing the data of the image data sets;
an image labeling and classifyingmodule 502, configured to label the enhanced data set, and divide the labeled data set into a training set, a verification set, and a test set;
the detectionmodel building module 503 is used for building a lightweight two-stage detection model on a TensorFlow deep learning framework, designing a mixed deep separation convolutional neural network as a preposed basic network, fusing different levels of characteristic information of the preposed basic network by adopting a characteristic pyramid network, and compressing the network channel dimension and the number of full connection layers of the detection head;
a detectionmodel training module 504, configured to initialize the lightweight two-stage detection model parameters, and input a training set to the detection model for training based on a random gradient descent method;
and the detectionresult output module 505 is used for inputting the images to be recognized into the detection model after the training is finished and outputting the type and position information of the vegetable seedlings.
The imageacquisition enhancing module 501 specifically includes:
the image acquisition unit is used for enabling the camera to form an included angle of 80-90 degrees with the horizontal direction of the crop row and to be about 80cm away from the ground, and acquiring various vegetable seedling images under different weather conditions, different illumination directions and different environmental backgrounds to construct an image data set; and the image enhancement unit is used for performing data enhancement on the image data set through geometric transformation and color transformation.
The image labeling and classifyingmodule 502 specifically includes:
the labeling unit is used for labeling the category and the position of the vegetable seedling in the enhanced data set by adopting labeling software;
and the classification unit is used for randomly splitting the labeling data set into a training set, a verification set and a test set according to the proportion of 7:2: 1.
The detectionmodel building module 503 specifically includes:
the pre-basic network unit is used for fusing a plurality of convolution kernels with different sizes into a single depth separable convolution operation to form a mixed depth separation convolution neural network, and the mixed depth separation convolution neural network is used as a pre-basic network to carry out feature acquisition on an input image;
the feature information fusion unit is used for introducing a feature pyramid network to fuse different levels of features of the preposed basic network, and inputting the fused feature map into a regional suggestion network to generate a series of sparse prediction frames;
and the lightweight detection head unit is used for calculating the output characteristics of the mixed deep separation convolutional neural network at the last stage by using asymmetric convolution in the detection head network to generate a characteristic diagram with less channel dimensions, comprehensively accessing the characteristic diagram and a prediction frame into 1 full-connection layer to obtain global characteristics of a detection target, and finishing target classification and position prediction based on 2 parallel branches.
The detectionmodel training module 504 specifically includes:
the initialization unit is used for using a pre-trained mixed deep separation convolutional neural network weight parameter model for the pre-feature extraction network, and randomly initializing the rest layers by using Gaussian distribution with the mean value of 0 and the standard deviation of 0.01;
the training unit is used for setting hyper-parameters related to model training and training on the basis of a random gradient descent method by adopting a multi-task loss function as a target function;
and the difficult sample mining unit is used for calculating the loss function of the input sample by utilizing an online difficult sample mining strategy in the training process, sorting the loss function of the input sample according to the sequence from large to small, and updating the model weight parameter by back propagation training of the difficult samples with the larger loss function of the first 1 percent.
The detectionresult output module 505 specifically includes:
the threshold setting unit is used for setting a category confidence threshold value of 0.5 and a threshold value of intersection and union ratio value of 0.5 in the trained detection model;
and the detection output unit is used for inputting the image to be recognized into the trained detection model to obtain the recognition result of the multi-class vegetable seedling, and the recognition result comprises a target class label, a class confidence coefficient and a target position frame.
The system and the method of the invention are in one-to-one correspondence, so the calculation process of some parameters in the method is also suitable for the calculation process in the system, and the detailed description in the system is omitted.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; while the invention has been described in detail and with reference to the foregoing embodiments, those skilled in the art will appreciate that; the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; these modifications and substitutions do not depart from the spirit of the invention in the form of examples, and are intended to be included within the scope of the claims and the specification.