Disclosure of Invention
In order to solve the defects in the prior art, the application provides a chip surface defect segmentation model based on improvement YOLOv-seg and a training method and application thereof, and the application utilizes the improved yolov-seg segmentation detection model to accurately segment and detect the chip surface defect, and can still ensure the detection precision especially when facing complex background and low contrast defects.
The technical scheme adopted by the invention is as follows:
A training method for a chip surface defect segmentation model based on improvement YOLOv-seg comprises the following steps:
Step 1, a network structure of an improved YOLOv-seg segmentation detection model is built, the improved YOLOv-seg segmentation detection model comprises a backbox module, a Neck module and a head module, wherein the backbox module comprises a multi-layer ShuffleNetV network structure, an input picture is subjected to multiple convolution processing through the ShuffleNetV2 network structure to extract 3 effective feature layers, the Neck module adopts a Concat _SDI network structure and an EMA attention mechanism network structure, the Concat _SDI network structure is used for carrying out multi-layer feature fusion on the effective feature layers output by the backbox module to obtain a feature map, the EMA attention mechanism network structure is used for carrying out feature weighting on the feature map, and the head module is a segmentation head and is used for outputting final feature information based on an output result of the Neck module;
And 2, after the network structure of the improved YOLOv-seg segmentation detection model is built, training, verifying and testing the built structure of the improved YOLOv-seg mesh segmentation detection model by using a picture with the surface defects of the chip.
Further, the backhaul module is formed by sequentially connecting a Conv_ maxpool network structure, a first ShuffleNetV network structure, a second ShuffleNetV network structure, a third ShuffleNetV network structure, a fourth ShuffleNetV2 network structure, a fifth ShuffleNetV network structure, a sixth ShuffleNetV2 network structure, an SPPF network structure and an EMA, wherein the fourth ShuffleNetV network structure, the sixth ShuffleNetV network structure and the SPPF network structure respectively output a first effective feature layer, a second effective feature layer and a third effective feature layer.
Further, according to different input Channel step sizes, different ShuffleNetV network structures are selected, when the input Channel step size is 1, the input is output in two paths after passing through CHANNEL SPLIT layers, one is directly connected with the input of Concat layers, the other is connected with the Conv layer, DWConv layers and Conv layer in sequence and then connected with the input of Concat layers, the input tensor is spliced through Concat layers and then is input into a Channel Shuffle layer, and the rearranged characteristic map channels are output by the Channel Shuffle layer;
When the step length of the input Channel is 2, the input is processed by two paths, one path comprises DWConv layers and Conv layers which are sequentially connected, the other path comprises Conv layers, DWConv layers and Conv layers which are sequentially connected, the output ends of the two paths are connected with the input of Concat layers, the input tensor is spliced by Concat layers and then is input into a Channel buffer layer, and the Channel buffer layer outputs rearranged feature map channels.
Further, the Neck module is composed of 2 Conv network structures, 4C 2f network structures, 2 upsamples, 4 Concat _SDI network structures and 3 EMA attention mechanism network structures, wherein the first upsamples are sequentially connected with a first Concat _SDI network structure, a first C2f network structure, a second upsamples, a second Concat _SDI network structure, a second C2f network structure, a first EMA attention mechanism network structure, a first Conv network structure, a third Concat _SDI network structure, a third C2f network structure, a second EMA attention mechanism network structure, a second Conv network structure, a fourth Concat _SDI network structure, a fourth C2f network structure and a third EMA attention mechanism network structure, and the first C2f network structure is also directly connected with the third Concat _SDI network structure.
Further, the process of multi-level feature fusion in Neck module is that Neck module receives 3 effective feature layers output by Backbone, wherein the first effective feature layer P1 output by the fourth ShuffleNetV network structure is connected to the second Concat _SDI network structure, and the same-scale fusion is carried out with the feature diagram N1 to be detected;
The second effective feature layer P2 output by the sixth ShuffleNetV network structure is accessed into the first Concat _SDI network structure, and the same-scale fusion is carried out on the second effective feature layer P2 and the feature diagram N2 to be detected;
and accessing a third effective feature layer P3 output by the SPPF module into a fourth Concat _SDI network structure, and carrying out same-scale fusion with the feature diagram N3 to be detected.
Further, the EMA attention mechanism network structure comprises a groups-style branch and a Cross-space learning branch, wherein in the groups-style branch, attention weight descriptors of the grouping feature map are extracted through three parallel routes, and the groups-styl branch not only encodes inter-channel information to adjust the importance of different channels, but also keeps accurate spatial structure information into the channels.
Further, in the groups-style branches, attention weight descriptors of the grouping feature graphs are extracted through three parallel routes, the three parallel routes comprise two parallel paths on the 1X1 branch and one path on the 3X3 branch, for any given input feature graph, the X equivalent of the feature graph is divided into G sub-features along the 1X1 branch to learn different semantics, the Y equivalent of the feature graph is subjected to feature learning along the other 1X1 branch, the outputs of the two 1X1 branches are processed through Concat and conv and then are output to be decomposed into two vectors, two-dimensional binomial distribution after linear convolution is fitted by utilizing two nonlinear Sigmoid functions to realize different Cross-channel interaction features, finally, the outputs of the two parallel paths of the 1X1 branch and the output after grouping are input into Re-weight together, and the feature graph is input into the Cross-space learning branch after feature learning along the 3X3 branch.
Further, in the Cross-spatial learning branch, the output of the 1x1 branch and the output of the 3x3 branch are used as tensors, wherein the output of the 1x1 branch is sequentially connected with GroupNorm, avg Pool and Softmax, matmul to output a vector of class probability, the output of the 3x3 branch is sequentially connected with Avg Pool and Softmax, matmul to output the probability of each class, and GroupNorm is required to be connected with Matmul in the 3x3 branch;
Finally, the outputs of the two branches are fused by using a nonlinear Sigmoid function to obtain a probability value between 0 and 1, and then Re-weighted by the Re-weighted characteristic together with the grouped outputs, so as to finally obtain the model parameters and the characteristics after smoothing.
The surface defect segmentation model based on the improved YOLOv-seg chip is obtained by training by adopting the method.
A chip surface defect segmentation detection method based on improvement YOLOv-seg comprises the following steps:
step 1, acquiring a picture of a chip surface defect to be identified by using an industrial camera;
And 2, inputting the image to be identified obtained in the step 1 into an improved YOLOv-seg chip surface defect segmentation model, carrying out segmentation prediction on the chip surface defects of the image to be identified, and outputting information of the chip surface defects in the industrial production process.
The invention has the beneficial effects that:
1. Aiming at the problem that the detection precision is lower when the existing method faces complex background and low-contrast defects, the invention improves the network structure of the feature extraction part and the feature fusion part of YOLOv network, and leads the details of the low-contrast defects to be better captured by introducing a lighter-weight feature extraction module, and adopts a finer fusion strategy in the process of feature fusion, so that the features from different layers can be more effectively complemented, and the optimization of the network structure not only improves the detection capability of the low-contrast defects, but also enhances the overall detection precision and reliability.
2. By introducing IoU (Intersection over Union) loss functions to quantify the prediction effect of the mask, taking IoU values as an additional loss term into the total loss function, the prediction performance of the model in the boundary and overlapping areas is directly optimized, the overall segmentation performance of the chip surface defects is effectively improved, various types of defects can be more accurately identified and positioned in practical application, and the detection result with higher quality is improved.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Example 1
The application discloses a model training method based on the surface defect segmentation of an improved YOLOv-seg chip, which is combined with the accompanying figures 1-6, and comprises the following steps:
step 1, a network structure of an improved YOLOv-seg split detection model is built, and as shown in fig. 2, the network structure of the improved YOLOv-seg split detection model specifically comprises a backhaul module, a Neck module and a head module.
1. The backhaul module mainly comprises 1 Conv_ maxpool network structure, 6 ShuffleNetV network structures, 1 SPPF network structure and 1 EMA attention mechanism structure, and is formed by sequentially connecting a Conv_ maxpool network structure, a first ShuffleNetV2 network structure, a second ShuffleNetV2 network structure, a third ShuffleNetV network structure, a fourth ShuffleNetV2 network structure, a fifth ShuffleNetV network structure, a sixth ShuffleNetV network structure, an SPPF network structure and an EMA, wherein the fourth ShuffleNetV network structure, the sixth ShuffleNetV network structure and the SPPF network structure respectively output a first effective feature layer, a second effective feature layer and a third effective feature layer. In the backbox module, 3 effective feature layers are extracted after the input picture is convolved for a plurality of times (ShuffleNetV network structures).
More specifically, the Conv_ maxpool network structure performs feature processing of extracting different layers on an input picture, the SPPF network structure performs multi-scale feature extraction and dimension reduction processing on the output of the last layer ShuffleNetV network structure, and the EMA performs smoothing processing on model parameters.
More specifically, shuffleNetV network architecture introduces a Channel Shuffle (Channel Shuffle) mechanism based on a depth separable convolution. The channel shuffling mechanism breaks the independence between channels through a "channel shuffling" operation, forcing the network to enhance feature expression while maintaining low computational load. Specifically, the ShuffleNetV network structure adopts 'grouping convolution' and 'channel shuffling' to reduce the calculation amount, and the channel of the feature map is rearranged to improve the feature extraction capability of the model.
The ShuffleNetV network structure is shown in fig. 3, and different ShuffleNetV network structures are selected according to different input Channel step sizes, for example, when the input Channel step size is 1, the ShuffleNetV network structure is shown in fig. 3 (a), the input is output in two paths after passing through CHANNEL SPLIT layers, one is directly connected with the input of Concat layers, the other is connected with the input of Concat layers after being sequentially connected with the Conv layer, DWConv layers and Conv layer, the input tensor is spliced through Concat layers and then is input into the Channel Shuffle layer, and the Channel Shuffle layer outputs rearranged feature map channels.
When the step length of the input Channel is 2, the ShuffleNetV network structure is as shown in fig. 3 (b), the input is processed by two paths, one path comprises DWConv layers and Conv layers which are sequentially connected, the other path comprises Conv layers, DWConv layers and Conv layers which are sequentially connected, the output ends of the two paths are connected with the input of Concat layers, the input tensor is spliced by Concat layers and then is input into a Channel buffer layer, and the Channel buffer layer outputs rearranged feature map channels.
Furthermore, a ReLU activation function is adopted in ShuffleNetV network structure, nonlinearity is introduced, the model can be fitted with complex functions, the training process can be accelerated on one hand, gradient disappearance is reduced because the derivative of a positive half shaft is constant to be 1, on the other hand, model sparsity can be improved because negative value output is zero, and model parameters are reduced, and negative value parts are set to be zero. Furthermore, the expression of the ReLU activation function is as follows:
f(x)=max(0,x)
Wherein x is the characteristic value input in the previous layer.
2. The Concat _SDI is adopted in the Neck module to realize multi-level feature fusion, and the Concat _SDI fuses high-level features containing more semantic information and low-level features capturing finer details for each level of feature map. The Neck module is composed of 2 Conv network structures, 4C 2f network structures, 2 upsamples, 4 Concat _SDI network structures and 3 EMA attention mechanism network structures; the first upsampling is sequentially connected with a first Concat _SDI network structure, a first C2f network structure, a second upsampling, a second Concat _SDI network structure, a second C2f network structure, a first EMA attention mechanism network structure, a first Conv network structure, a third Concat _SDI network structure, a third C2f network structure, a second EMA attention mechanism network structure, a second Conv network structure, a fourth Concat _SDI network structure, a fourth C2f network structure and a third EMA attention mechanism network structure, the first C2f network structure is also directly connected with the third Concat _SDI network structure, the first upsampling receives an output of the EMA attention mechanism network structure in a backhaul module, the fourth Concat _SDI network structure receives a third effective feature layer of the SPPF output in the backhaul module, the first Concat _SDI network structure receives a second effective feature layer of the sixth ShuffleNetV network structure output in the backhaul module, and the first Concat _SDI network structure receives a fourth effective feature layer of the second effective network structure of the second Concat _SDI structure in the backhaul module 76.
More specifically, the Concat _sdi network structure is shown in fig. 5, and is composed of four feature extraction layers, a forward propagation calculation layer and an output layer.
As shown in fig. 5, 3 effective feature layers output by a backhaul are input into a Concat _sdi network structure, a first effective feature layer P1 output by a fourth ShuffleNetV network structure is connected into a second Concat _sdi network structure, and the same-scale fusion is performed with a feature diagram N1 to be detected;
The second effective feature layer P2 output by the sixth ShuffleNetV network structure is accessed into the first Concat _SDI network structure, and the same-scale fusion is carried out on the second effective feature layer P2 and the feature diagram N2 to be detected;
and accessing a third effective feature layer P3 output by the SPPF module into a fourth Concat _SDI network structure, and carrying out same-scale fusion with the feature diagram N3 to be detected.
For the feature graphs N1, N2, N3 and N1 to be detected are feature graphs obtained through second upsampling, N2 is a feature graph obtained through upsampling from the previous network layer N1, a feature graph generated through upsampling and fusion of a second effective feature layer P2 and a feature extraction network part F2 output by a sixth ShuffleNetV network structure, respectively, the feature graphs are obtained through feature fusion of three different network levels, and N3 is a feature graph obtained through a second Conv network structure.
More specifically, EMA attention mechanism network architecture as shown in fig. 4, EMA enhances feature learning capability by parallelizing 1x1 and 3x3 convolutions. The 1x1 branch keeps the channel dimension unchanged and captures fine-grained channel information, while the 3x3 branch aggregates multi-scale spatial information. Through functional grouping and cross-space learning, the EMA can effectively model local and global feature dependencies, improving pixel-level attention and performance. Finally, the output of the EMA is consistent with the input size. The EMA cross-space efficient multi-scale attention mechanism realizes feature weighting by learning the importance degrees of different areas of the image, and performs weighted fusion on feature mapping and corresponding attention weights to obtain final multi-scale feature representation.
In combination with fig. 4, the ema attention mechanism network structure includes a groups-style branch and a Cross-space learning branch, in which, in the groups-style branch, attention weight descriptors of the grouping feature map are extracted through three parallel routes, the three parallel routes include two parallel paths on the 1X1 branch and one path on the 3X3 branch, for any given input feature map, the X equivalent of the feature map is divided into G sub-features along the 1X1 branch (i.e. the channel dimension direction) to learn different semantics, and the Y equivalent of the feature map is feature-learned along the other 1X1 branch (i.e. the channel dimension direction), specifically, two 1D global average pooling operations are adopted to encode channels along two spatial directions in the 1X1 branch.
The outputs of the two 1x1 branches are processed by Concat and conv and then decomposed into two vectors, i.e. the two coding features are connected along the image height direction, so that the two coding features share the same 1x1 convolution without reducing the dimension of the 1x1 branch. And finally, the output of the two parallel paths of the 1x1 branch and the grouped output are input into Re-weight together to Re-weight the characteristics.
The feature map also performs feature learning along the 3x3 branch, stacks only one 3x3 kernel in the 3x3 branch to capture a multi-scale feature representation to expand the feature space, and then enters into the Cross-spatial learning branch.
The groups-styl branch not only encodes inter-channel information to adjust the importance of different channels, but also retains accurate spatial structure information into the channels.
In the Cross-spatial learning branch, the output of the 1x1 branch and the output of the 3x3 branch are taken as tensors.
The output of the 1x1 branch is sequentially connected with the vector of the class probability after GroupNorm, the Avg Pool and Softmax, matmul, the output of the 3x3 branch is sequentially connected with the Avg Pool and Softmax, matmul to output the probability of each class, and GroupNorm is required to be connected with Matmul in the 3x3 branch;
Finally, the outputs of the two branches are fused by using a nonlinear Sigmoid function to obtain a probability value between 0 and 1, and then Re-weighted by the Re-weighted characteristic together with the grouped outputs, so as to finally obtain the model parameters and the characteristics after smoothing.
And then, carrying out global space information coding on the output of the 1x1 branch by utilizing two-dimensional global average pooling, and directly converting the output of the minimum branch into a corresponding dimension shape before the channel characteristic joint activation mechanism. Multiplying the parallel processed outputs by a matrix dot product operation yields a first spatial attention map. In addition, the global space information is encoded on the 3x3 branch by two-dimensional global average pooling, the 1x1 branch is directly converted into a corresponding dimension shape before the channel characteristic joint activation mechanism, and on the basis, a second space attention pattern for reserving the whole accurate space position information is derived. Finally, the output feature map within each group is computed as a set of two spatial attention weight values generated, and then a Sigmoid function is used. It captures the pairwise relationship at the pixel level and highlights the global context of all pixels. The final output of EMA is the same size as X.
More specifically, the Concat _SDI network structure of the invention is to increase SDI on the basis of common splicing to realize multi-level feature fusion, and the formula is as follows:
Wherein f0i represents the original feature map of the i-th stage,AndParameters representing spatial and channel attention mechanisms, respectively, f1i is a processed profile.
The SDI module enhances the representation capability of each level of feature map by combining semantic information of high-level features with detailed information of low-level features, so that the precision in an image segmentation task is improved, compared with a traditional feature map splicing mode, the jump connection of the module is simpler and more efficient, and the calculation complexity and GPU memory use are reduced;
Further, the SDI intermediate principle process is explained in detail by the SDI formula:
Firstly, reducing the channel number of f1i to a super-parameter through convolution of 1x1 to obtain a new characteristic diagram f2i;
Further, the feature map is transferred to a decoder, and the SDI module adjusts the feature map of each stage to the same resolution. The adjusted feature map is denoted as f3ij, where i represents the target level and j represents the source level of the feature map. The adjusting operation comprises the following steps:
Further, for the case of j < i, adaptive average pooling is used to adjust the size;
for the case of j=i, an identity mapping is used;
For the case of j > i, resizing using bilinear interpolation;
Further, the adjusted feature images are smoothed through 3x3 convolution and expressed as f4ij, and finally, all the adjusted feature images are fused together through Hadamard products, so that semantic information and detail information of each level of feature images are enhanced, and f5i is obtained.
3. The head module is a YOLOv split head, and specifically includes 3 segments, where the 3 segments correspond to the first EMA attention mechanism network structure, the second EMA attention mechanism network structure, and the third EMA attention mechanism network structure in the Neck module respectively, and final feature information including category labels, bounding box coordinates, bounding box areas, split masks, and confidence levels is output.
Segment predicts the category of each pixel by using a feature map extracted by a feature extraction network, adopts IoU values between IoU loss function prediction masks and real masks, adds IoU as an additional loss term into total loss, optimizes the prediction of a model in boundary and overlapping areas, and improves the overall segmentation performance, wherein the formula is as follows:
Wherein Intersection (Pred, target) is the intersection area of the predicted segmented region and the true labeled region, union (Pred, target) is the Union area of the predicted segmented region and the true labeled region.
And 2, after the network structure of the improved YOLOv-seg segmentation detection model is built, training, verifying and testing the built structure of the segmentation detection model based on the improved YOLOv-seg mesh. The method comprises the following specific steps:
step 2.1, data set preparation.
Firstly, 900 pictures with chip surface defects are obtained by using an industrial camera, labelme software is used for marking the collected pictures with the chip surface defects, the marking name is "Crack", and in the embodiment, the marking format adopts PascalVOC format.
And (2) carrying out image preprocessing on the images in the data set in the step (2.1), wherein the preprocessing specifically comprises data enhancement on the chip surface defect data set in the modes of random cutting, rotation, overturning, color dithering, contrast and the like, so that the robustness of the model on different data conditions can be improved. The image is resized to the input size required by the model, with a uniform size 640 x 640. And normalize the pixel values to a range of 0 to 1 or perform normalization (subtracting the mean value divided by the standard deviation) to make model training more stable. Meanwhile, four pictures are spliced by using a Mosaic data enhancement method, so that the batch_size is indirectly improved, and a single GPU can achieve a better training effect. The reinforced data set is divided to obtain a training set, a verification set and a test set, wherein the dividing ratio is (training set + verification set) the test set=9:1, and the training set and the verification set=9:1.
And 2.2, training a model.
The improved YOLOv-seg-split detection model described above is trained using a training set and a validation set of the dataset. The trained epochs were set to 500 times, the first 150 epochs frozen the stem feature extraction network and set the initial learning rate to 0.01, and the last 300 epochs thawed the stem feature extraction network and set the initial learning rate to 0.0001. And (3) performing verification of segmentation detection on the test set of the constructed chip surface defect data set to finish accurate segmentation of the chip surface defects.
Example 2
The method can be used for obtaining a chip surface defect segmentation model based on the improvement YOLOv-seg.
Example 3
A chip surface defect segmentation detection method based on improvement YOLOv-seg comprises the following steps:
step 1, acquiring a picture of a chip surface defect to be identified by using an industrial camera;
And 2, inputting the image to be identified obtained in the step 1 into a chip surface defect segmentation model based on the improvement YOLOv-seg as in the embodiment 2, carrying out segmentation prediction on the chip surface defects of the image to be identified, and outputting information of the chip surface defects in the industrial production process, wherein the information comprises category label names, boundary frame coordinates, boundary frame areas, segmentation masks and category confidence.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.