CN116703819B

Movatterモバイル変換

Info

Publication number: CN116703819B
Application number: CN202310399454.XA
Authority: CN
Inventors: 杨绿溪; 谢昂; 郑志刚; 李春国; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2025-07-15
Anticipated expiration: 2043-04-14
Also published as: CN116703819A

Abstract

The invention provides a rail wagon steel floor breakage detection method based on knowledge distillation, which comprises the following steps of obtaining a rail wagon steel floor region image, constructing a training set, constructing a steel floor breakage detection teacher network and a student network, training, distilling the student network by using the teacher network, obtaining a final fault detection model by adjusting parameters, obtaining an image to be detected, processing the image, and inputting the fault detection model to obtain a steel floor breakage detection result. Based on the deep convolutional neural network and knowledge distillation, the structure of an encoder and a decoder is adopted, and prediction matching between queries is established through progressive multi-stage knowledge distillation so as to gradually transfer useful knowledge to a student model. The automatic detection method with high accuracy and high precision solves the problem that visual fatigue caused by fault judgment can only be carried out by naked eyes of dynamic inspectors at the present stage to cause false leakage detection.

Description

Rail wagon steel floor damage detection method based on knowledge distillation

Technical Field

The invention belongs to the field of target detection in computer vision, and particularly relates to a rail wagon steel floor breakage detection method based on knowledge distillation.

Background

In recent years, the deep convolutional neural network model is continuously complicated, parameters and calculated amount are continuously expanded, and great challenges are brought to calculation resources and storage resources. In order to solve the problem that the embedded equipment with limited resources is difficult to deploy, some neural network model compression technologies are adopted to reduce the volume and the calculated amount of the model, so that the deep learning model is efficiently deployed in the environment with limited resources.

Knowledge distillation is a commonly used model compression method and is applied to the field of image classification. The knowledge distillation generally trains a complex model with excellent performance as a teacher model, and then uses the knowledge learned by the teacher model to guide training a simpler student model, so that the performance of the student model is equivalent to that of the teacher model, but the quantity and complexity of network parameters are greatly reduced, thereby realizing compression and acceleration of the model.

Although knowledge distillation achieves good effect in the traditional target detection method based on the convolutional neural network, the traditional algorithm is difficult to directly apply to the target detection method based on the transducer because the architecture of the convolutional neural network and Transomer are not the same. This is because in the convolutional neural network-based target detection method, target information is carried by an image feature map, whereas in the transform-based target detection method, target information is mainly encoded in a query vector, and this great difference causes a significant difference in the target information feature distribution in the two detection methods.

Disclosure of Invention

The invention aims to solve the problems that a target detection method model based on a Transformer network is huge and can not be deployed in a storage space and a terminal with limited computing resources, and introduces knowledge distillation into the target detection method based on the Transformer, and provides a rail wagon steel floor breakage detection method based on the knowledge distillation, which comprises the following steps:

Step 1, acquiring images of a plurality of angles at the bottom of a train;

Step 2, selecting a picture containing the damage of the steel floor and a picture without the damage, marking the damaged part, and distinguishing a sample without the failure from a small number of samples with the failure;

Step 3, constructing a fault detection model, including a teacher model and a student model;

step 4, training a teacher model and a student model in parallel by using the train bottom image obtained in the step 1, transferring the knowledge of the teacher model to the student model by utilizing progressive multi-stage knowledge distillation and teacher feature distillation, transferring the dark knowledge of a decoder layer in the teacher model to the student model layer by the teacher model, training the student model according to a knowledge distillation loss function to realize knowledge distillation, and finally obtaining a trained fault detection model;

and 5, verifying a model detection result, obtaining an image to be detected, inputting a fault detection model, and calculating an abnormal score to obtain a steel floor breakage detection result.

Furthermore, the teacher model and the student model have the same structure and comprise a main network module, an encoder module, a decoder module and a prediction output module, but the teacher model and the student model use the main network modules with different sizes; the method comprises the steps that after an image passes through a backbone network module, high-dimensional vector information is extracted and sent to an encoder module, the encoder module carries out semantic coding on features and then sends the features to a decoder module, the decoder carries out cross attention on key values of a feature map and corresponding region features, and finally a final detection result is output through a prediction module;

The encoder module comprises six encoders, wherein the encoders are Transformer encoders, the encoder module is formed by superposing 6 identical encoders, each encoder is provided with two sublayers, the first sublayer is a multi-head self-attention convergence layer, the second sublayer is a feed-forward neural network layer based on positions, and each sublayer adopts residual error linkage. The encoder sums the serialized feature map with the position code to obtain the next query Q and the key value K, sums and normalizes the feature map after passing through a multi-head self-attention layer, obtains the output of a single encoder through a feed-forward network, and is used as the input of the next encoder, and obtains the output of the encoder part after passing through 6 identical encoder structures.

The input of the decoder consists of three parts, namely the output of the encoder, position encoding and query. Wherein the dimensions of the query are (300,4), the first dimension is the number of predefined target queries, and the second dimension is the number of hidden layers. The decoder calculates the first half of the multi-headed self-attention as the encoder operates the same, then calculates the cross-attention with the output of the encoder, and gets the output of the decoder section after 6 identical decoder structures.

The prediction output module comprises a feedforward neural network and full connection. The feedforward neural network layer is mainly divided into two parts, one part predicts the category and the other part predicts the position. The branches of the prediction category consist of a linear layer with a hidden layer dimension of 512. Since there is a background class (empty class), the dimension of the output is the number of classes plus 1. The branch of the feedforward neural network at another predicted position is mainly composed of 3 linear layers with hidden layers and dimension 512, and both branches pass through a sigmoid activation function.

Further, the backbone network of the teacher model is ResNet, and the backbone network of the student model is ResNet.

Further, the loss function of the student model in step 4 is a composite loss function consisting of a teacher soft tag and a student hard tag, and is:

wherein alpha, beta is a super parameter,For the loss of the soft label of the teacher,Hard tag loss for students.

Further, the selected distillation anchor frame does not participate in the counter-propagation of the student model during the distillation training by inquiring and sampling the positive and negative sample anchor frames, and the loss functions of the positive and negative sample anchor frames and the random sample anchor frame are as follows

Wherein, theFor a positive sample anchor frame distillation loss,Is the distillation loss of the anchor frame of the negative sample,Loss for random anchor frame distillation.

The rail wagon steel floor breakage detection method based on knowledge distillation provided by the invention mainly comprises a teacher network model and a student network model, wherein a main network of the teacher network model adopts ResNet to 101, and a main network of the student network model is ResNet. The method comprises the steps of establishing prediction matching among queries through progressive multi-stage knowledge distillation to gradually transfer useful knowledge to a student model, and simultaneously adopting a teacher characteristic distillation method to fully utilize the middle characteristics of a teacher and provide additional information for one-to-one allocation strategy groups in the student model.

Drawings

FIG. 1 is a schematic flow chart of a method for detecting the damage of the steel floor of the railway wagon based on knowledge distillation.

Fig. 2 is a network configuration diagram and a detailed block diagram of the present invention.

Fig. 3 is a block diagram of a teacher model feature distillation.

FIG. 4 is a map index comparison of the improved algorithm of the present invention and the student model algorithm.

Detailed Description

As shown in fig. 1, the method for detecting the damage of the steel floor of the railway wagon based on knowledge distillation comprises the following steps:

and step 1, acquiring images of a plurality of angles at the bottom of the train.

Firstly, acquiring a whole car image of a railway car during running through a high-speed camera, wherein the whole car image comprises a side frame, a middle part and a car coupler buffer part, selecting an image containing the bottom of a steel floor from the images, detecting the acquired image of the bottom,

Screening the obtained pictures, and reserving images containing target parts, wherein the number ratio of the pictures containing broken steel floors to the pictures without breakage is 1:1, marking the broken parts, and distinguishing samples without faults from a small number of samples with faults;

Step 3, constructing a teacher model and a student model;

The encoder modules of the teacher model and the student model have the same structure and are encoder modules formed by 6 layers of transformers encoders.

As shown in fig. 2, the teacher model and the student model both include a backbone network module, an encoder module, a decoder module and a prediction output module, the image is sent to the encoder module after passing through the backbone network module to extract high-dimensional vector information, the encoder module sends the encoded feature to the decoder module after carrying out semantic encoding, the decoder carries out cross attention on the key value of the feature map and the corresponding region feature, and finally the final detection result is output through the prediction module.

The backbone network module comprises an input layer, a first group of convolution layers, a maximum pooling layer, a second group of convolution layers, a third group of convolution layers, a fourth group of convolution layers and a fifth group of convolution layers which are connected in sequence.

The encoder module comprises six encoders, wherein the encoders are Transformer encoders, the encoder module is formed by superposing 6 identical encoders, each encoder is provided with two sublayers, the first sublayer is a multi-head self-attention convergence layer, the second sublayer is a feed-forward neural network layer based on positions, and each sublayer adopts residual error links. The encoder sums the serialized feature map with the position code to obtain the next query Q and the key value K, sums and normalizes the feature map after passing through a multi-head self-attention layer, obtains the output of a single encoder through a feed-forward network, and is used as the input of the next encoder, and obtains the output of the encoder part after passing through 6 identical encoder structures.

The input of the decoder consists of three parts, namely the output of the encoder, position coding and query. Wherein the dimensions of the query are (300,4), the first dimension is the number of predefined target queries, and the second dimension is the number of hidden layers. That is, the decoder translates 300 queries into 300 targets based on the encoded characteristics of the encoder. The decoder calculates the first half of the multi-headed self-attention as the encoder operates the same, then calculates the cross-attention with the output of the encoder, and gets the output of the decoder section after 6 identical decoder structures.

And the prediction output module is used for continuing to calculate according to the characteristics output by the decoder and mainly comprises a feedforward neural network and a full-connection layer. The feedforward neural network layer is mainly divided into two parts, one part predicts the category and the other part predicts the position. The branches of the prediction category consist of a linear layer with a hidden layer dimension of 512. Since there is a background class (empty class), the dimension of the output is the number of classes plus 1. The branch of the feedforward neural network at another predicted position is mainly composed of 3 linear layers with hidden layers and dimension 512, and both branches pass through a sigmoid activation function.

And 3, as shown in fig. 3, training a teacher model and a student model in parallel by using the bottom image of the train obtained in the step 1, transferring the knowledge of the teacher model to the student model by using progressive multi-stage knowledge distillation and teacher feature distillation, performing one-to-one matching on forward reasoning results of the teacher model and the student model by using a Hungary algorithm, transmitting dark knowledge of a decoder layer in the teacher model to the student model layer by layer, and simultaneously providing probability information before model non-normalization for the student model by using a teacher feature distillation method.

And inputting the input data into a teacher model, transferring the dark knowledge of a decoder layer in the teacher model to a student model layer by layer, and training the student model according to a knowledge distillation loss function to realize knowledge distillation. The implementation process is that a training teacher model backbone network is ResNet, a student model backbone network is ResNet and is formed by connecting five layers of convolution layer residual errors, wherein the difference is that only downsampling operation is carried out in the convolution layers of ResNet, the size of convolution kernels is 3x3, the output characteristic depth of a fifth layer of convolution layers is 512, and the convolution layers of ResNet101 simultaneously have upsampling and downsampling operations, and the output characteristic depth of the fifth layer of convolution layers is 2048. Both the teacher model and the student model output feature maps of the third through fifth layers and downsampled to 256 as hidden layer dimensions when the encoder portion is input. The encoder structures of the teacher model and the student model are the same, and are both a transducer encoder structure.

The query distillation anchor box of the decoder section is a combination of the image and the query. Because the query has the effect of detecting and aggregating certain example features, the distribution in different stem models may not be uniform, the same number of similar positive sample (foreground) and negative sample (background) anchor boxes are selected as the distillation anchor boxes. Meanwhile, the decoder also has a multi-stage structure, and on the basis, the hidden knowledge of the teacher model is better acquired by progressive multi-stage knowledge distillation. And calculating the cross attention part at each layer of decoder, guiding the student model by using an attention weight matrix of the teacher model, and carrying out weighted fusion on attention weights corresponding to the distillation points to enable the student model to obtain target features with richer semantic information.

The training backbone network is a ResNet teacher network model, and soft labels are generated by high temperature on the basis of the trained teacher network model. At this time, the loss function of the student model is no longer the loss function of the hard tag, but is a composite loss function consisting of the teacher soft tag and the student hard tag. The soft labels of the teachers in the loss function enable the class probability distribution of the student model to be as close as possible to the teacher model, so that the characteristic response of the student model and the characteristic response of the teacher model are as close as possible under the loss of square errors, and the hard labels of the students in the loss function are prediction results of the student model. The composite loss function is weighted by distillation loss (teacher soft tag part) and student model loss (student hard tag), and is

Through positive and negative sample anchor frame query sampling, the student model can focus on the region which is more focused by the teacher, and random sampling provides the vision of the teacher model to the features. During the distillation training, these selected distillation anchor boxes do not participate in the back propagation of the student model. The loss functions of the positive and negative sample anchor frames and the random sampling anchor frame are as follows

And 4, acquiring an image to be detected, inputting a fault detection model after processing, and calculating an abnormal score to obtain a steel floor breakage detection result.

The results of the inventive test are shown in fig. 4.

In the foregoing description, only specific embodiments of the invention have been described, and any features disclosed in this specification may be substituted for other equivalent or alternative features serving a similar purpose, and all the features disclosed, or all the steps in a method or process, except where mutually exclusive, may be combined in any manner.

Claims

1. The method for detecting the damage of the steel floor of the railway wagon based on knowledge distillation is characterized by comprising the following steps of:

Step 1, acquiring images of a plurality of angles at the bottom of a train;

step 4, training a teacher model and a student model in parallel by using the train bottom image obtained in the step 1, transferring the knowledge of the teacher model to the student model by utilizing progressive multi-stage knowledge distillation and teacher feature distillation, simultaneously transferring the dark knowledge of a decoder layer in the teacher model to the student model layer by the teacher model, training the student model according to a knowledge distillation loss function to realize knowledge distillation, and finally obtaining a trained fault detection model;

step 5, verifying a model detection result, obtaining an image to be detected, inputting a fault detection model, and calculating an abnormal score to obtain a steel floor breakage detection result;

inquiring and sampling through positive and negative sample anchor frames, wherein during distillation training, the selected distillation anchor frames do not participate in counter-propagation of student models, and the loss functions of the positive and negative sample anchor frames and the random sample anchor frames are as follows

2. The method for detecting the breakage of the steel floor of the railway wagon based on the knowledge distillation according to claim 1, wherein the teacher model and the student model have the same structure and comprise a main network module, an encoder module, a decoder module and a prediction output module, and the teacher model and the student model only use the main network modules with different sizes;

The image is sent to the encoder module after the high-dimensional vector information is extracted by the backbone network module, the encoder module carries out semantic coding on the features and then sends the features to the decoder module, the decoder carries out cross attention on the key values of the feature map and the corresponding regional features, and finally the final detection result is output by the prediction module.

3. The method for detecting the damage of the steel floor of the railway wagon based on the knowledge distillation according to claim 2, wherein the main network module comprises an input layer, a first group of convolution layers, a maximum pooling layer, a second group of convolution layers, a third group of convolution layers, a fourth group of convolution layers and a fifth group of convolution layers which are connected in sequence;

4. The method for detecting the breakage of the steel floor of the railway wagon based on the knowledge distillation according to claim 2, wherein,

The encoder module comprises six encoders, wherein the encoders are Transformer encoders, the encoder module is formed by superposing 6 identical encoders, each encoder is provided with two sublayers, the first sublayer is a multi-head self-attention convergence layer, the second sublayer is a feed-forward neural network layer based on position, each sublayer adopts residual error linkage, the encoder adds the serialized characteristic diagram and the position code to obtain a subsequent query Q and a key value K, the subsequent query Q and the characteristic diagram are added and normalized after passing through a multi-head self-attention layer, the output of a single encoder is obtained through a feed-forward network and is used as the input of the next encoder, and the output of an encoder part is obtained after passing through 6 identical encoder structures;

the decoder module comprises an input of the decoder, a decoder module and a decoder module, wherein the input of the decoder consists of three parts of output of the encoder, position coding and query, the dimension of the query is 300,4, the first dimension is the number of predefined target queries, and the second dimension is the number of hidden layers;

The prediction output module comprises a feedforward neural network and full connection, wherein the feedforward neural network layer is mainly divided into two parts, one part predicts a class and the other part predicts a position, a branch of the prediction class consists of linear layers with the dimension of 512 of one hidden layer, the output dimension is the class number plus 1 due to the existence of a background class (empty class), and a branch of the other prediction position of the feedforward neural network consists of linear layers with the dimension of 512 of 3 hidden layers, and the two branches are subjected to a sigmoid activation function.

5. The method for detecting the breakage of the steel floor of the railway wagon based on the knowledge distillation according to claim 2, wherein the backbone network of the teacher model is ResNet, and the backbone network of the student model is ResNet.

6. The method for detecting the damage of the steel floor of the railway wagon based on the knowledge distillation according to claim 1, wherein the loss function of the student model in the step 4 is a composite loss function consisting of a teacher soft tag and a student hard tag, and is characterized in that: