CN115205667A

Movatterモバイル変換

Info

Publication number: CN115205667A
Application number: CN202210920891.7A
Authority: CN
Inventors: 宋雪桦; 顾寅武; 张舜尧; 王昌达; 金华; 袁昕
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-10-18

Abstract

The invention relates to a dense target detection method based on YOLOv5 s. Respectively adding a space attention mechanism and a channel attention mechanism into different branches of the CSP module; a RepVGG Block module is used in the backhaul to improve the identification precision of targets with different scales and increase the reasoning speed; an SA attention module is added to improve the feature extraction capability of the algorithm; CARAFE upsampling is used in Neck to obtain a larger receptive field; a variable local Loss function is introduced, and high-quality positive samples are more concerned in the dense target sample training. According to the method, the fish is used as the data set for training, and the trained model weight is used for detection, so that the consumption of manpower and material resources is effectively reduced, the detection accuracy is improved, and the requirements of intensive target detection tasks can be better met.

Description

Translated fromChinese

一种基于YOLOv5s的密集目标检测方法A Dense Object Detection Method Based on YOLOv5s

技术领域technical field

本发明涉及计算机视觉目标检测领域，具体涉及一种基于YOLOv5s的密集目标检测方法。The invention relates to the field of computer vision target detection, in particular to a dense target detection method based on YOLOv5s.

背景技术Background technique

视觉目标检测旨在定位和识别图像中存在的物体,属于计算机视觉领域的经典任务之一,也是许多计算机视觉任务的前提与基础,在自动驾驶、视频监控、水产养殖、智慧农业等领域具有重要的理论研究意义和实际应用价值。随着深度学习技术的飞速发展,目标检测取得了巨大的进展。以往的人工检测方式准确率差，效率低，耗时耗力。随着图像处理技术的不断发展，传统的机器学习通过支持向量机，进行分类识别，该方法检测结果准确率不高，且容易造成漏检误检等情况。近年来，对于诸多领域内存在密集目标的情况，采用计算机视觉技术结合深度学习方法进行检测逐渐成为主流，目标检测识别算法它通过卷积神经网络来自动提取目标特征，相较于以往的方法，具有更快的检测速度和更高的检测准确率。Visual object detection aims at locating and recognizing objects in images. It is one of the classic tasks in the field of computer vision, and it is also the premise and foundation of many computer vision tasks. The theoretical research significance and practical application value. With the rapid development of deep learning technology, object detection has made great progress. The previous manual detection methods have poor accuracy, low efficiency, time-consuming and labor-intensive. With the continuous development of image processing technology, traditional machine learning uses support vector machines to classify and identify. In recent years, for the situation of dense targets in many fields, the use of computer vision technology combined with deep learning methods for detection has gradually become the mainstream. The target detection and recognition algorithm automatically extracts target features through convolutional neural networks. Compared with previous methods, It has faster detection speed and higher detection accuracy.

发明内容SUMMARY OF THE INVENTION

针对上述存在的问题，提出一种基于YOLOv5s的密集目标检测模型。采用该模型能够较好的满足密集目标检测检测任务的需求。In view of the above problems, a dense target detection model based on YOLOv5s is proposed. The model can better meet the needs of dense target detection and detection tasks.

为了实现上述目的，本发明采用的技术方案如下：一种基于YOLOv5s的密集目标检测方法，包括如下步骤：In order to achieve the above purpose, the technical solution adopted in the present invention is as follows: a dense target detection method based on YOLOv5s, comprising the following steps:

1)将检测装置置于投饵船前端检测鱼群数量情况，所述检测装置包括摄像装置和照明装置；所述摄像装置用于拍摄鱼群进行数量检测；所述照明装置保持常亮用于水下照明；1) Place the detection device at the front end of the bait-casting boat to detect the number of fish schools, and the detection device includes a camera device and a lighting device; the camera device is used for photographing fish schools for quantity detection; the lighting device is kept on for constant light. underwater lighting;

2)构建鱼类数据集D2，划分训练集D_train和验证集D_test；2) construct fish data set D2, divide training set D_train and verification set D_test ;

3)构建YOLOv5s网络模型，所述YOLOv5s网络模型包括Input、Backbone、Neck、Prediction；所述Input包括Mosaic数据增强、自适应锚框计算、自适应图片缩放；所述Backbone包括Focus模块、SPP模块和C3模块；所述颈部网络Neck包括FPN模块、PAN模块、C3模块；所述Prediction包括Bounding box损失函数和NMS；3) construct YOLOv5s network model, described YOLOv5s network model includes Input, Backbone, Neck, Prediction; Described Input includes Mosaic data enhancement, adaptive anchor frame calculation, adaptive picture zoom; Described Backbone includes Focus module, SPP module and C3 module; described neck network Neck includes FPN module, PAN module, C3 module; described Prediction includes Bounding box loss function and NMS;

4)修改主干网络卷积模块，将主干网络卷积模块修改为RepVGG Block模块；4) Modify the backbone network convolution module, and modify the backbone network convolution module to the RepVGG Block module;

5)修改主干网络结构，在RepVGG模块与SPP模块之间插入SA注意力机制；5) Modify the backbone network structure and insert the SA attention mechanism between the RepVGG module and the SPP module;

6)修改YOLOv5s颈部网络的上采样方式，将最邻近上采样改为CARAFE上采样方式；6) Modify the upsampling method of the YOLOv5s neck network, and change the nearest neighbor upsampling to the CARAFE upsampling method;

7)将评价目标框和预测框的类损失和置信度损失的损失函数Focal Loss修改为Varifocal Loss损失函数；7) Modify the loss function Focal Loss of evaluating the class loss and confidence loss of the target frame and the prediction frame to the Varifocal Loss loss function;

8)对鱼类数据集D2进行迁移训练，得到训练权重w；即用GIOU_Loss作为损失函数，当模型损失曲线趋近于0且无明显波动时，停止训练，得到训练权重w，否则继续训练；8) Perform migration training on the fish dataset D2 to obtain the training weight w; that is, use GIOU_Loss as the loss function, when the model loss curve approaches 0 and there is no obvious fluctuation, stop the training, and obtain the training weight w, otherwise continue training;

9)输入图像，进行鱼群检测，将获取到的鱼群图像输入到训练权重为w的模型中，模型根据权重自动识别鱼群数量。9) Input an image, perform fish school detection, and input the obtained fish school image into a model with a training weight of w, and the model automatically identifies the number of fish schools according to the weight.

进一步地，上述步骤2)包括如下步骤：Further, above-mentioned step 2) comprises the steps:

2.1)从鱼类公共数据集选取N张，构建数据集D1；2.1) Select N pieces from the fish public data set to construct the data set D1;

2.2)使用标注工具Labelimg对数据集D1中每一张图像中的鱼类进行标注，构建鱼类数据集D2；2.2) Use the labeling tool Labelimg to label the fish in each image in the dataset D1 to construct the fish dataset D2;

2.3)按照比例将鱼类数据集D2划分为训练集D_train和验证集D_test。2.3) Divide the fish dataset D2 into a training set D_train and a validation set D_test according to the proportion.

进一步地，上述步骤4)包括如下步骤：Further, above-mentioned step 4) comprises the steps:

4.1)训练多分支模型：在训练时，为每一个3×3卷积层添加平行的1×1卷积分支与恒等映射分支；4.1) Training a multi-branch model: During training, add parallel 1×1 convolution branches and identity mapping branches to each 3×3 convolutional layer;

4.2)将多分支模型等价转换为单路模型：将1×1卷积看成卷积核中有很多0的3×3卷积，恒等映射是一个特殊1×1卷积；根据卷积的可加性原理，每个RepVGG Block模块三个分支则可以合并为一个3×3卷积；4.2) Equivalently convert the multi-branch model to a one-way model: regard the 1×1 convolution as a 3×3 convolution with many 0s in the convolution kernel, and the identity map is a special 1×1 convolution; according to the volume According to the principle of product additivity, the three branches of each RepVGG Block module can be combined into a 3×3 convolution;

4.3)结构参数重构：通过实际数据流，将多分支网络的权值转移到简单网络中。4.3) Structural parameter reconstruction: Through the actual data flow, the weights of the multi-branch network are transferred to the simple network.

进一步地，上述步骤5)包括如下步骤：Further, above-mentioned step 5) comprises the steps:

5.1)特征分组：假设输入特征为X∈R^C×H×W，其中C、H、W分别表示通道数、高度和宽度，特征分组会将输入X沿着通道维度拆分为g组，使得每个子功能在训练过程中逐渐捕获特定的语义响应；5.1) Feature grouping: Assuming that the input feature is X∈R^C×H×W , where C, H, and W represent the number of channels, height and width, respectively, the feature grouping will split the input X into g groups along the channel dimension, so that Each sub-feature gradually captures a specific semantic response during training;

5.2)使用通道注意力机制，捕获通道相关性信息，计算公式如下：5.2) Using the channel attention mechanism to capture channel correlation information, the calculation formula is as follows:

X′_k1＝σ(W₁s+b₁)·X_k1X′_k1 =σ(W₁ s+b₁ )·X_k1

式中：s表示信道统计，X_k1为在通道维度被分成的一个分支，X′_k1表示通道注意力的最终输出，σ为sigmoid激活函数，W₁与b₁是形状为C/2G×1×1的参数。In the formula: s represents the channel statistics, X_k1 is a branch divided in the channel dimension, X′_k1 represents the final output of the channel attention, σ is the sigmoid activation function, W₁ and b₁ are the shape of C/2G×1 ×1 parameter.

5.3)使用空间注意力机制，捕获空间相关性信息，计算公式见下：5.3) Use the spatial attention mechanism to capture spatial correlation information. The calculation formula is as follows:

X′_k2＝σ(W₂·GN(X_K2)+b₂)·X_k2X′_k2 =σ(W₂ ·GN(X_K2 )+b₂ )·X_k2

式中：X_k2为在通道维度被分成的一个分支，X′_k2表示空间注意力的最终输出，W₂与b₂是形状为C/2G×1×1的参数，GN表示组归一化方法；where X_k2 is a branch divided in the channel dimension, X′_k2 represents the final output of spatial attention, W₂ and b₂ are parameters of shape C/2G×1×1, GN represents group normalization method;

5.4)聚合：在完成通道注意力、空间注意力计算后，对两种注意力进行集成，通过Concat进行融合得到：X′_k＝[X′_k1,X′_k2]∈R^C/2G×H×W，采用通道置换操作(channel shuffle)进行组间通信。5.4) Aggregation: After completing the calculation of channel attention and spatial attention, the two kinds of attention are integrated and fused through Concat: X′_k = [X′_k1 , X′_k2 ]∈R^{C/2G×H ×W} , using channel shuffle for inter-group communication.

进一步地，上述步骤6)中，包括如下步骤：Further, in above-mentioned step 6), comprise the following steps:

6.1)特征图通道压缩：假设上采样倍率为σ，对于形状为C×H×W的输入特征图，其中C、H、W分别表示通道数、高度和宽度，用1×1卷积将它的通道数压缩到C_m；6.1) Feature map channel compression: Assuming that the upsampling magnification is σ, for an input feature map with a shape of C×H×W, where C, H, and W represent the number of channels, height and width, respectively, use 1×1 convolution to convolve it. The number of channels compressed to C_m ;

6.2)内容编码及上采样核预测：对步骤6.1)压缩后的输入特征图，利用

的卷积层预测上采样核，假设上采样的卷积核为k_up×k_up，输入通道数为C_m，输出通道数为

将通道维在空间维展开，得到形状为

的上采样核；6.2) Content encoding and upsampling kernel prediction: For the compressed input feature map in step 6.1), use

The convolutional layer predicts the upsampling kernel, assuming that the upsampling convolution kernel is k_up ×k_up , the number of input channels is C_m , and the number of output channels is

Expand the channel dimension in the space dimension to get the shape as

The upsampling kernel of ;

6.3)上采样核归一化：对步骤6.2)得到的上采样核每个通道k_up×k_up利用softmax进行归一化，使得卷积核权重和为1；对于输出特征图中的每个位置，将其映射回输入特征图，取出以之为中心的k_up×k_up的区域，和预测出的该点的上采样核作点积，得到输出值；相同位置的不同通道共享同一个上采样核。6.3) Normalization of the upsampling kernel: normalize each channel k_up ×k_up of the upsampling kernel obtained in step 6.2) using softmax, so that the sum of the convolution kernel weights is 1; for each channel in the output feature map position, map it back to the input feature map, take out the k_up ×k_up area centered on it, and do a dot product with the predicted upsampling kernel of the point to get the output value; different channels at the same position share the same Upsampling kernel.

进一步地，上述步骤7)中，所述Varifocal Loss损失函数公式如下：Further, in the above step 7), the Varifocal Loss loss function formula is as follows:

式中：p是预测的IACS，q是目标IoU得分，对于正样本，q是预测包围框和gt框之间的IoU，对于负样本，q为0。where p is the predicted IACS, q is the target IoU score, for positive samples, q is the IoU between the predicted bounding box and gt box, and for negative samples, q is 0.

进一步地，上述步骤8)中，所述GIoU_Loss损失函数转换公式如下：Further, in the above step 8), the GIoU_Loss loss function conversion formula is as follows:

式中：where:

IoU表示两个重叠矩形框之间的交并比；I表示两个矩形的重叠部分，U表示的是两个矩形面积之和A^p+A^g掉两个矩形相交面积I，A^c是两个矩形的最小外界面积。IoU represents the intersection ratio between two overlapping rectangles; I represents the overlapping part of the two rectangles, U represents the sum of the areas of the two rectangles, A^p + A^g minus the intersection area I of the two rectangles, and A^c is the two rectangles. The minimum outer area of a rectangle.

本发明提供一种基于YOLOv5s的密集目标检测方法，采用了融合了RepVGG模块，注意力机制和CARAFE上采样模块的检测模型。该方法能够有效提高在密集目标图像检测任务中的综合性能，极大提高检测准确性，对自动驾驶、视频监控、水产养殖业的发展具有重要意义。The present invention provides a dense target detection method based on YOLOv5s, which adopts a detection model integrating RepVGG module, attention mechanism and CARAFE upsampling module. This method can effectively improve the comprehensive performance in the task of dense target image detection, greatly improve the detection accuracy, and is of great significance to the development of autonomous driving, video surveillance, and aquaculture.

附图说明Description of drawings

图1为本发明中基于YOLOv5s的密集目标检测方法流程图。FIG. 1 is a flowchart of the dense target detection method based on YOLOv5s in the present invention.

图2为本发明中YOLOv5s网络结构图。FIG. 2 is a structural diagram of the YOLOv5s network in the present invention.

图3为本发明中主干网络RepVGG Block模块结构图。FIG. 3 is a structural diagram of a backbone network RepVGG Block module in the present invention.

图4为本发明中SA注意力机制结构图。FIG. 4 is a structural diagram of the SA attention mechanism in the present invention.

具体实施方式Detailed ways

下面结合附图以及具体实施例对本发明做进一步的说明，需要指出的是，下面仅以一种优选的技术方案对本发明的技术方案以及设计原理进行详细阐述，但本发明的保护范围并不限于此。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. It should be noted that the technical solution and design principle of the present invention are described in detail below only with a preferred technical solution, but the protection scope of the present invention is not limited to this.

所述实施例为本发明的优选的实施方式，但本发明并不限于上述实施方式，在不背离本发明的实质内容的情况下，本领域技术人员能够做出的任何显而易见的改进、替换或变型均属于本发明的保护范围。The embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above-mentioned embodiments, and any obvious improvement, replacement or All modifications belong to the protection scope of the present invention.

本发明提供的一种基于YOLOv5s的密集目标检测方法流程见图1，包括如下步骤：The flow chart of a method for dense target detection based on YOLOv5s provided by the present invention is shown in Figure 1, which includes the following steps:

3)构建YOLOv5s网络模型，YOLOv5s网络结构图见图2，所述YOLOv5s网络模型包括Input、Backbone、Neck、Prediction；所述Input包括Mosaic数据增强、自适应锚框计算、自适应图片缩放；所述Backbone包括Focus模块、SPP模块和C3模块；所述颈部网络包括FPN模块、PAN模块、C3模块；所述Prediction包括Bounding box损失函数和NMS；其中，所述主干网络C3模块其结构分为两支,一支使用多个Bottleneck堆叠和3个标准卷积层，另一支仅经过一个基本卷积模块，最后将两支进行concat操作；3) Build a YOLOv5s network model, the YOLOv5s network structure diagram is shown in Figure 2, and the YOLOv5s network model includes Input, Backbone, Neck, Prediction; the Input includes Mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling; the Backbone includes Focus module, SPP module and C3 module; described neck network includes FPN module, PAN module, C3 module; described Prediction includes Bounding box loss function and NMS; wherein, the structure of described backbone network C3 module is divided into two parts One branch uses multiple Bottleneck stacks and 3 standard convolutional layers, the other only passes through a basic convolution module, and finally the two branches are concat operated;

6)修改YOLOv5s颈部网络的上采样方式，将最邻近上采样改为CARAFE上采样方式。6) Modify the upsampling method of the YOLOv5s neck network, and change the nearest neighbor upsampling to the CARAFE upsampling method.

7)将评价目标框和预测框的类损失和置信度损失的损失函数Focal Loss修改为Varifocal Loss。7) Modify the loss function Focal Loss for evaluating the class loss and confidence loss of the target frame and the prediction frame to Varifocal Loss.

8)对鱼类数据集D2进行迁移训练,得到训练权重w；即用GIOU_Loss作为损失函数，当模型损失曲线趋近于0且无明显波动时，停止训练，得到训练权重w，否则继续训练；8) Perform migration training on the fish data set D2 to obtain the training weight w; that is, use GIOU_Loss as the loss function, when the model loss curve approaches 0 and there is no obvious fluctuation, stop the training, and obtain the training weight w, otherwise continue training;

9)输入图像，进行鱼类检测，将获取到的鱼类图像输入到训练权重为w的模型中，模型根据权重自动识别鱼类数量。9) Input the image, perform fish detection, and input the obtained fish image into the model with training weight w, and the model automatically identifies the number of fish according to the weight.

作为本发明的优选实施例，步骤2)包括如下步骤：As a preferred embodiment of the present invention, step 2) comprises the following steps:

2.2)使用标注工具Labelimg对数据集D1的中每一张图像中的鱼类进行标注，构建鱼类数据集D2；2.2) Use the labeling tool Labelimg to label the fish in each image in the dataset D1 to construct the fish dataset D2;

2.3)按比例将鱼类数据集D2划分为训练集D_train和验证集D_test。2.3) Divide the fish dataset D2 into a training set D_train and a validation set D_test in proportion.

作为本发明的优选实施例，RepVGG Block卷积结构如图3所示，上述步骤4)包括如下步骤：As a preferred embodiment of the present invention, the RepVGG Block convolution structure is shown in Figure 3, and the above step 4) includes the following steps:

4.1)训练多分支模型。在训练时，为每一个3×3卷积层添加平行的1×1卷积分支与恒等映射分支。4.1) Train a multi-branch model. During training, parallel 1×1 convolution branches and identity mapping branches are added to each 3×3 convolutional layer.

4.2)将多分支模型等价转换为单路模型。将1×1卷积看成卷积核中有很多0的3×3卷积，恒等映射是一个特殊1×1卷积。根据卷积的可加性原理，每个RepVGG Block模块三个分支则可以合并为一个3×3卷积。4.2) Equivalently convert the multi-branch model to a one-way model. Think of the 1×1 convolution as a 3×3 convolution with many 0s in the convolution kernel, and the identity map is a special 1×1 convolution. According to the additivity principle of convolution, the three branches of each RepVGG Block module can be combined into a 3×3 convolution.

4.3)结构参数重构。通过实际数据流，将多分支网络的权值转移到简单网络中。4.3) Structure parameter reconstruction. The weights of the multi-branch network are transferred to the simple network through the actual data flow.

作为本发明的优选实施例，SA模块结构见图4，上述步骤5)包括如下步骤：As a preferred embodiment of the present invention, the SA module structure is shown in Figure 4, and the above step 5) includes the following steps:

5.1)特征分组，假设输入特征为X∈R^C×H×W，其中C、H、W分别表示通道数、高度和宽度，特征分组会将输入X沿着通道维度拆分为g组，使得每个子功能在训练过程中逐渐捕获特定的语义响应；5.1) Feature grouping, assuming that the input feature is X∈R^C×H×W , where C, H, and W represent the number of channels, height and width, respectively, and the feature grouping will split the input X into g groups along the channel dimension, so that Each sub-feature gradually captures a specific semantic response during training;

5.2)使用通道注意力机制。捕获通道相关性信息，计算公式如下：5.2) Use channel attention mechanism. Capture channel correlation information, the calculation formula is as follows:

X′_k1＝σ(W₁s+b₁)·X_k1X′_k1 =σ(W₁ s+b₁ )·X_k1

5.3)使用空间注意力机制。捕获空间相关性信息，计算公式见下：5.3) Use spatial attention mechanism. Capture spatial correlation information, the calculation formula is as follows:

式中：X_k2表示在通道维度被分成的一个分支，X′_k2表示空间注意力的最终输出，W₂与b₂是形状为C/2G×1×1的参数，GN表示组归一化方法。where X_k2 represents a branch divided in the channel dimension, X′_k2 represents the final output of spatial attention, W₂ and b₂ are parameters of shape C/2G×1×1, GN represents group normalization method.

5.4)聚合。在完成前面两种注意力计算后，对其进行集成，首先通过简单的Concat进行融合得到：X′_k＝[X′_k1,X′_k2]∈R^C/2G×H×W。最后采用通道置换操作(channel shuffle)进行组间通信。5.4) Aggregation. After completing the first two attention calculations, they are integrated, firstly obtained by simple Concat fusion: X′_k = [X′_k1 , X′_k2 ]∈R^C/2G×H×W . Finally, the channel shuffle operation is used for inter-group communication.

作为本发明的优选实施例，上述步骤6)包括如下步骤：As a preferred embodiment of the present invention, the above step 6) includes the following steps:

6.1)特征图通道压缩，假设上采样倍率为σ，对于形状为C×H×W的输入特征图，其中C、H、W分别表示通道数、高度和宽度，用1×1卷积将它的通道数压缩到C_m，减少后续步骤计算量。6.1) Feature map channel compression, assuming that the upsampling magnification is σ, for the input feature map with shape C×H×W, where C, H, W represent the number of channels, height and width respectively, use 1×1 convolution to convolve it. The number of channels is compressed to C_m , reducing the amount of computation in subsequent steps.

6.2)内容编码及上采样核预测，对于第一步中压缩后的输入特征图，利用

的卷积层来预测上采样核，假设上采样的卷积核为k_up×k_up，输入通道数为C_m，输出通道数为

将通道维在空间维展开，得到形状为

的上采样核。6.2) Content encoding and upsampling kernel prediction, for the compressed input feature map in the first step, use

to_predict the_upsampling_kernel by using the convolutional layer of

Expand the channel dimension in the space dimension to get the shape as

upsampling kernel.

6.3)上采样核归一化，对得到的上采样核每个通道k_up×k_up利用softmax进行归一化，使得卷积核权重和为1。对于输出特征图中的每个位置，我们将其映射回输入特征图，取出以之为中心的k_up×k_up的区域，和预测出的该点的上采样核作点积，得到输出值。相同位置的不同通道共享同一个上采样核。6.3) Normalization of the upsampling kernel, using softmax for each channel k_up ×k_up of the obtained upsampling kernel, so that the sum of the convolution kernel weights is 1. For each position in the output feature map, we map it back to the input feature map, take out the k_up × k_up region centered on it, and do a dot product with the predicted upsampling kernel of that point to get the output value . Different channels at the same location share the same upsampling kernel.

作为本发明的优选实施例，上述步骤7)中的Varifocal Loss损失函数公式如下：As a preferred embodiment of the present invention, the Varifocal Loss loss function formula in the above step 7) is as follows:

其中，p是预测的IACS，q是目标IoU得分，对于正样本，q是预测包围框和gt框之间的IoU，对于负样本，q为0。where p is the predicted IACS, q is the target IoU score, for positive samples, q is the IoU between the predicted bounding box and gt box, and for negative samples, q is 0.

作为本发明的优选实施例，步骤8)中的GIoU_Loss损失函数转换公式如下：As a preferred embodiment of the present invention, the GIoU_Loss loss function conversion formula in step 8) is as follows:

式中，In the formula,

Claims

1. A dense target detection method based on YOLOv5s is characterized by comprising the following steps:

1) Placing a detection device at the front end of a bait casting boat to detect the number of fish schools, wherein the detection device comprises a camera device and an illuminating device; the camera device is used for shooting the fish school for quantity detection; the lighting device is kept normally on for underwater lighting;

2) Constructing a fish data set D2, and dividing a training set D_train And a verification set D_test ；

3) Constructing a YOLOv5s network model, wherein the YOLOv5s network model comprises Input, backhaul, neck and Prediction; the Input comprises Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling; the backhaul comprises a Focus module, an SPP module and a C3 module; the Neck network tack comprises an FPN module, a PAN module and a C3 module; the Prediction comprises a Bounding box loss function and NMS;

4) Modifying the backbone network convolution module, and modifying the backbone network convolution module into a RepVGG Block module;

5) Modifying a backbone network structure, and inserting an SA attention mechanism between the RepVGG module and the SPP module;

6) Modifying an upsampling mode of a Yolov5s neck network, and changing nearest upsampling into a CARAFE upsampling mode;

7) Modifying a Loss function Focal Loss of class Loss and confidence coefficient Loss of an evaluation target frame and a prediction frame into a Varifocal Loss function;

8) Carrying out migration training on the fish data set D2 to obtain a training weight w; using GIOU _ Loss as a Loss function, stopping training when the model Loss curve approaches to 0 and has no obvious fluctuation, and obtaining a training weight w, otherwise, continuing training;

9) Inputting images, detecting fish shoals, inputting the obtained fish shoal images into a model with a training weight of w, and automatically identifying the number of the fish shoals by the model according to the weight.

2. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 2) comprises the steps of:

2.1 N pieces of fish public data are selected to construct a data set D1;

2.2 Using a labeling tool Labelimg to label the fish in each image in the data set D1 to construct a fish data set D2;

2.3 Proportionally dividing a fish data set D2 into training sets D_train And a verification set D_test 。

3. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 4) comprises the steps of:

4.1 Training the multi-branch model: during training, adding parallel 1 × 1 convolution branches and identity mapping branches for each 3 × 3 convolution layer;

4.2 Equivalent transformation of the multi-branch model into a single-way model: considering the 1 × 1 convolution as a 3 × 3 convolution with many 0's in the convolution kernel, the identity mapping is a special 1 × 1 convolution; according to the additive principle of convolution, three branches of each RepVGG Block module can be combined into a 3 multiplied by 3 convolution;

4.3 Structural parameter reconstruction: and transferring the weight of the multi-branch network into the simple network through the actual data flow.

4. The YOLOv5 s-based dense target detection method of claim 1, wherein the step 5) comprises the steps of:

5.1 Feature grouping: assuming that the input features are X ∈ R^C×H×W Wherein C, H, W represent channel number, height and width respectively, the feature grouping will split the input X into g groups along the channel dimension, so that each sub-function gradually captures specific semantic response in the training process;

5.2 Using a channel attention mechanism to capture channel correlation information, the calculation formula is as follows:

X_k ′₁ ＝σ(W₁ s+b₁ )·X_k1

in the formula: s denotes channel statistics, X_k1 For a branch divided in the channel dimension, X_k ′₁ Represents the final output of the channel attention, σ is sigmoid activation function, W₁ And b₁ Is a parameter with a shape of C/2G × 1 × 1;

5.3 Using a spatial attention mechanism to capture spatial correlation information, the calculation formula is as follows:

X′_k2 ＝σ(W₂ ·GN(X_K2 )+b₂ )·X_k2

in the formula: x_k2 Is a branch, X 'divided in the channel dimension'_k2 Final output, W, representing spatial attention₂ And b₂ Is a parameter with a shape of C/2G × 1 × 1, GN represents the group normalization method;

5.4 Polymerization: after the calculation of the channel attention and the space attention is completed, the two kinds of attention are integrated and fused by Concat to obtain: x_k ′＝[X′_k1 ,X′_k2 ]∈R^C/2G×H×W And performing inter-group communication by adopting channel permutation operation (channel buffer).

5. The method for detecting dense targets based on YOLOv5s as claimed in claim 1, wherein the step 6) comprises the following steps:

6.1 ) feature map channelsCompression: assuming that the up-sampling multiplying factor is sigma, for an input feature map with the shape of C × H × W, wherein C, H, and W respectively represent the number of channels, height, and width, the 1 × 1 convolution is used to compress the number of channels to C_m ；

6.2 Content coding and upsampling kernel prediction: for the input characteristic diagram compressed in the step 6.1), use

The upsampled convolution kernel of (1) assuming the upsampled convolution kernel is k_up ×k_up The number of input channels is C_m The number of output channels is

The channel dimension is expanded in the space dimension to obtain the shape of

The upsampling core of (a);

6.3 Upsampling kernel normalization: for each channel k of the up-sampling kernel obtained in the step 6.2)_up ×k_up Normalization is performed by softmax, so that the sum of the weights of convolution kernels is 1; for each position in the output profile, it is mapped back to the input profile, taking out the k centered on it_up ×k_up Performing dot product on the predicted up-sampling kernel of the point to obtain an output value; different channels at the same location share the same upsampling core.

6. The YOLOv5 s-based dense target detection method according to claim 1, wherein in the step 7), the Varifocal Loss function formula is as follows:

in the formula: p is the predicted IACS, q is the target IoU score, q is the IoU between the prediction bounding box and the gt box for positive samples, and q is 0 for negative samples.

7. The YOLOv5 s-based dense target detection method according to claim 1, wherein in the step 8), the GIoU _ Loss function conversion formula is as follows:

in the formula:

IoU represents the intersection ratio between two overlapping rectangular boxes; i denotes the overlap of two rectangles, U denotes the sum A of the areas of the two rectangles^p +A^g The intersection area I, A of two rectangles^c Is the minimum outer area of the two rectangles.