Pedestrian target detection method based on multi-scale multi-feature neural networkTechnical Field
The invention relates to the technical field of pedestrian detection, in particular to a pedestrian target detection method based on a multi-scale multi-feature neural network.
Background
Pedestrian detection techniques, which aim to identify and locate pedestrians from image or video sources, have become an important tool to evaluate whether pedestrians are present in an image or video sequence, and to mark and display them. At present, the pedestrian detection technology is widely applied to a plurality of fields such as intelligent driving auxiliary systems, intelligent robot technology, intelligent video monitoring and the like, and plays a key role in improving the intelligent level of the systems. The pedestrian detection technology is not isolated, and is closely connected with tasks such as pedestrian tracking, behavior recognition and the like in video analysis, but complex factors such as various postures of pedestrians, shielding among pedestrians, shielding between pedestrians and the background and the like have become key challenges for influencing the accuracy of pedestrian detection. Therefore, the detection capability of the pedestrian detection technology on small target pedestrians in a complex environment is improved, and the method has important research significance and application prospect.
The invention patent of China with the application number of CN202210979010.9 discloses an all-weather-oriented cross-mode self-adaptive fusion pedestrian target detection system and method, which mainly comprise a cross-mode differential information fusion module and a confidence perception self-adaptive fusion module. The cross-modal differential information fusion module is mainly used for carrying out complementary feature enhancement on the feature information of the visible light and infrared modes extracted by the network, enhancing the space information of the differential feature map of the visible light and the infrared modes through global pooling and average pooling operation, acquiring fusion feature vectors of all modes through a full connection layer and a Tanh activation function, and further carrying out feature enhancement expression on the initially extracted visible light and infrared mode features respectively; the confidence perception self-adaptive fusion module fully utilizes the confidence perception expression to carry out self-adaptive weighting on the characteristics among the different enhanced modes, so that the network detector can better select the reliable modes for processing, and the robustness of the detector is improved; and finally, optimizing network model parameters by utilizing the multitasking loss.
The invention patent of China with the application number of CN202410103120.8 discloses a pedestrian target detection method, device, equipment and medium based on a monitoring scene, which comprises the following steps: step one, acquiring a current video frame in target monitoring in real time; inputting the current video frame into a pre-trained human head positioning depth learning network model to perform human head positioning processing, so as to obtain at least one human head position corresponding to each frame image in the current video frame; step three, acquiring pedestrian bounding boxes corresponding to each frame image in the current video frame through an improved target detection algorithm; and step four, optimizing the pedestrian bounding box through the head point positions of the pedestrians to obtain a pedestrian target detection bounding box result, so that the number of pedestrians and the positions of the pedestrians corresponding to the current video frame are determined according to the pedestrian target detection bounding box result.
The prior art has the following limitations:
(1) When the data input into the model are images with different scales, the detection accuracy of the method is lower;
(2) When the background in the image is too much to block pedestrians, the method cannot accurately detect the pedestrians in the image.
In view of the above, existing target detection techniques present challenges and shortcomings in pedestrian detection, and further research and improvement is needed to improve accuracy, efficiency and applicability.
Disclosure of Invention
In view of the above problems, a first aspect of the present invention proposes a pedestrian target detection method based on a multi-scale multi-feature neural network, by first collecting image data including a pedestrian and preprocessing it; secondly, extracting feature graphs with different scales from the processed data through a designed feature pyramid network, fusing the feature graphs, enabling the network to learn richer feature information, and further refining the fused feature graphs by adopting convolution operation; finally, introducing a cross entropy loss function and a smooth L1 loss function, calculating gradient through a back propagation algorithm to update the weight of the model, and obtaining a final target detection model through sufficient training and deployment and use, wherein the method comprises the following steps:
Step 1, pedestrian data acquisition and processing; collecting image data containing pedestrians, preprocessing the original image data, and dividing the preprocessed data into a training set and a testing set;
Step 2, designing a multi-scale multi-feature fusion module; the image is processed on multiple scales simultaneously by introducing a characteristic pyramid structure network, and characteristic graphs of different scales are fused, so that the model can learn more abundant characteristic information;
step 3, designing a pedestrian target detection module; using a target detection model based on a deep convolutional neural network;
Step 4, training the model; testing and verifying the trained model effect by using a test set, and storing a final pedestrian target detection model;
Step 5, deploying the model; the model is deployed on a hardware platform to detect pedestrians.
Preferably, the step 1 specifically includes the following steps:
s201, acquiring a real image containing pedestrians, marking the acquired effect to form a corresponding tag data set, and recording image data containing the pedestrians as { x1, x2, & gt, xn }, wherein the corresponding tags are recorded as { y1, y2, & gt, yn };
S202, preprocessing the image data in the step S1, including smoothing the image by using Gaussian filtering, and improving the quality and definition of the image data; data enhancement is performed using rotation, translation, brightness and contrast adjustment.
Preferably, the step 2 specifically includes the following steps:
S301, introducing a characteristic pyramid structure; the Feature pyramid structure can enable the network to process images on multiple scales simultaneously, each level of the pyramid is fused with the Feature map of the adjacent level through top-down upsampling and transverse connection, and the upsampling calculation process is as follows assuming that Featurel represents the Feature map of the first level:
Ul=Upsample(Featurel)
(1)
Where Ul represents the up-sampled feature map, upsample () represents the up-sampling operation;
The feature fusion calculation process of the transverse connection is as follows:
Pl=Fl+U(l―1)
(2)
Wherein Pl is the Feature map after fusion, featurel is the bottom-up layer I Feature map, and U(l-1) is the result of upsampling the top-down layer (l-1) Feature map;
S302, the fused feature graphs are subjected to 3X 3 convolution operation to further refine, so that each layer of the feature pyramid can generate more abundant and strong-adaptability feature representations, the performance of the whole network is improved, and the calculation process of the convolution operation is as follows:
Rl=Conv3×3(Pl)
(3)
Wherein Rl is the final feature map after 3×3 convolution operation; further refinement may be considered as a feature mapping process, in which the new value of each pixel in the feature map is obtained by weighted summation of pixel values in the surrounding neighborhood, where the calculation process is as follows:
Rij=∑m∑nKernelmn×Featurei―m,j―n (4)
Where Rij is the pixel value on the refined Feature map, featurei-m,j-n is the pixel value on the original Feature map, kernelmn is the weight of the convolution Kernel, and m, n represent the position of the element in the convolution Kernel.
Preferably, the step 3 specifically includes the following steps:
The YOLOv network structure is selected as a main model structure of the pedestrian target detection module and is used for receiving the multi-scale fusion feature map output by the multi-scale multi-feature fusion module, and the calculation process is as follows:
Y=YOLOv8(Pl1,Pl2,...,Pln) (5)
Where Y is the output of YOLOv and Pli is a feature map of different scale.
Preferably, the step 4 specifically includes the following steps:
S501, defining a loss function, selecting a cross entropy loss function as a classification loss function, and selecting a smooth L1 loss function as a bounding box regression loss function;
the cross entropy loss function is calculated as follows:
Wherein N is the total number of samples, C is the total number of categories, yi,c is the one-hot encoding of the real label of sample i, and pi,c is the probability that model prediction sample i belongs to category C;
the calculation formula of the smoothed L1 loss function is as follows:
where x is the difference between the predicted value and the true value;
S502, setting model termination conditions, and terminating training of the model when loss of the model in the training process is continuously unchanged for 10 times;
S503, setting an optimizer of the model, selecting Adam as the model training optimizer, and updating parameters of the model and improving the convergence rate of the model;
S504, the trained optimal model is stored and used for subsequent model deployment.
Preferably, the step 5 specifically includes the following steps:
s601, converting the trained complete model into a format compatible with a hardware platform, and loading and initializing the model, wherein the operations comprise loading model weight, distributing running memory space for the model and the like;
s602, inputting image data to be detected, enabling the model to detect pedestrians in the middle image, and performing the following calculation process:
Result=Model(Image) (8)
The Model represents a trained network, the Image represents an input Image to be detected, and the Result represents a pedestrian detection Result.
The second aspect of the present invention also provides a pedestrian target detection device based on a multi-scale multi-feature neural network, which is characterized in that: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of a pedestrian target detection model based on a multi-scale multi-feature neural network constructed by the construction method according to the first aspect is stored in the memory; when the processor executes a computer execution program stored in the memory, the processor is caused to execute a pedestrian target detection method based on the multi-scale multi-feature neural network.
Compared with the prior art, the invention has the following beneficial effects:
(1) The images are processed on multiple scales simultaneously through the characteristic pyramid structure network, and the characteristic images of different scales are fused, so that the model can learn richer characteristic information, and the detection capability of targets with different sizes is improved;
(2) The recognition capability of the model to the pedestrian target is enhanced by utilizing a multi-scale multi-feature fusion technology, and particularly, the detection of small target pedestrians in a complex environment is improved, so that the detection precision is improved;
(3) In the data preprocessing stage, gaussian filtering is adopted to carry out image smoothing and data enhancement is carried out by methods of rotation, translation, brightness adjustment, contrast adjustment and the like, so that the generalization capability of the model is further improved; in the model training stage, a cross entropy loss function and a smooth L1 loss function are selected, and an Adam optimizer is selected, so that the model training efficiency and the model training convergence speed are improved.
In general, the invention provides a reliable solution for pedestrian target detection and has wide application prospect.
Drawings
Fig. 1 shows the main steps of the present invention.
Fig. 2 is a general flow chart of a pedestrian target detection method based on a multi-scale multi-feature neural network.
Fig. 3 pedestrian detection results.
Fig. 4 is a schematic structural diagram of a pedestrian target detection device based on a multi-scale multi-feature neural network in embodiment 2 of the present invention.
Detailed Description
The invention will be further described with reference to specific examples.
Example 1:
The invention provides a pedestrian target detection method based on a multi-scale multi-feature neural network, the general flow of the invention is shown in figure 1, the general flow of the pedestrian target detection method based on the multi-scale multi-feature neural network is shown in figure 2, and the method comprises the following key steps: firstly, image data containing pedestrians are collected and preprocessed; secondly, extracting feature graphs with different scales from the processed data through a designed feature pyramid network, fusing the feature graphs, enabling the network to learn richer feature information, and further refining the fused feature graphs by adopting convolution operation; and finally, introducing a cross entropy loss function and a smooth L1 loss function, calculating gradient through a back propagation algorithm to update the weight of the model, and obtaining a final target detection model through sufficient training and deployment.
1. Pedestrian data acquisition and processing, including the following steps:
s201, acquiring a real image containing pedestrians, marking the acquired effect to form a corresponding tag data set, and recording image data containing the pedestrians as { x1, x2, & gt, xn }, wherein the corresponding tags are recorded as { y1, y2, & gt, yn };
S202, preprocessing the image data in the step S1, including smoothing the image by using Gaussian filtering, and improving the quality and definition of the image data; data enhancement is performed using rotation, translation, brightness and contrast adjustment.
2. The design of the multi-scale multi-feature fusion module comprises the following steps:
S301, introducing a characteristic pyramid structure; the Feature pyramid structure can enable the network to process images on multiple scales simultaneously, each level of the pyramid is fused with the Feature map of the adjacent level through top-down upsampling and transverse connection, and the upsampling calculation process is as follows assuming that Featurel represents the Feature map of the first level:
Ul=Upsample(Featurel)
(1)
Where Ul represents the up-sampled feature map, upsample () represents the up-sampling operation;
The feature fusion calculation process of the transverse connection is as follows:
Pl=Fl+U(l―1)
(2)
Wherein Pl is the Feature map after fusion, featurel is the bottom-up layer I Feature map, and U(l-1) is the result of upsampling the top-down layer (l-1) Feature map;
S302, the fused feature graphs are subjected to 3X 3 convolution operation to further refine, so that each layer of the feature pyramid can generate more abundant and strong-adaptability feature representations, the performance of the whole network is improved, and the calculation process of the convolution operation is as follows:
Rl=Conv3×3(Pl)
(3)
Wherein Rl is the final feature map after 3×3 convolution operation; further refinement may be considered as a feature mapping process, in which the new value of each pixel in the feature map is obtained by weighted summation of pixel values in the surrounding neighborhood, where the calculation process is as follows:
Rij=∑m∑nKernelmn×Featurei―m,j―n
(4)
Where Rij is the pixel value on the refined Feature map, featurei-m,j-n is the pixel value on the original Feature map, kernelmn is the weight of the convolution Kernel, and m, n represent the position of the element in the convolution Kernel.
3. The pedestrian target detection module design comprises the following steps:
The YOLOv network structure is selected as a main model structure of the pedestrian target detection module and is used for receiving the multi-scale fusion feature map output by the multi-scale multi-feature fusion module, and the calculation process is as follows:
Y=YOLOv8(Pl1,Pl2,...,Pln) (5)
Where Y is the output of YOLOv and Pli is a feature map of different scale.
4. Neural network model training, comprising the steps of:
S501, defining a loss function, selecting a cross entropy loss function as a classification loss function, and selecting a smooth L1 loss function as a bounding box regression loss function;
the cross entropy loss function is calculated as follows:
Wherein N is the total number of samples, C is the total number of categories, yi,c is the one-hot encoding of the real label of sample i, and pi,c is the probability that model prediction sample i belongs to category C;
the calculation formula of the smoothed L1 loss function is as follows:
where x is the difference between the predicted value and the true value;
S502, setting model termination conditions, and terminating training of the model when loss of the model in the training process is continuously unchanged for 10 times;
S503, setting an optimizer of the model, selecting Adam as the model training optimizer, and updating parameters of the model and improving the convergence rate of the model;
S504, the trained optimal model is stored and used for subsequent model deployment.
5. The target detection model deployment comprises the following steps:
s601, converting the trained complete model into a format compatible with a hardware platform, and loading and initializing the model, wherein the operations comprise loading model weight, distributing running memory space for the model and the like;
s602, inputting image data to be detected, enabling the model to detect pedestrians in the middle image, and performing the following calculation process:
Result=Model(Image) (8)
The Model represents a trained network, the Image represents an input Image to be detected, and the Result represents a pedestrian detection Result.
The following example results were presented for this process:
In order to verify the effectiveness of the pedestrian target detection method based on the multi-scale multi-feature neural network, the embodiment provides a displayable result shown in fig. 3, all pedestrians in the graph are framed by a white rectangular frame, and most bodies of the pedestrians are framed by the white rectangular frame, so that the method has higher pedestrian target detection precision. The embodiment also provides a comparison between the results of the current target detection models Fast R-CNN, SSD and YOLOv, and the index for evaluating the performance of the model selects the overall average accuracy mAP. The test results show that mAP of Fast R-CNN is 83.4, mAP of SSD is 86.1, mAP of Yolov7 is 86.8, and mAP of the method is 89.2. The method provided by the invention obtains the maximum value of mAP in all comparison methods, and shows that compared with the existing method, the method provided by the invention has the highest pedestrian target detection precision, and the effectiveness and the practicability of the method provided by the invention are further verified.
Example 2:
as shown in fig. 4, the present application also provides a pedestrian target detection device based on a multi-scale multi-feature neural network, the device comprising at least one processor and at least one memory, and further comprising a communication interface and an internal bus; the memory stores computer executing program; a memory stores a computer-implemented program for pedestrian target detection based on a multi-scale multi-feature neural network constructed by the construction method described in embodiment 1; when the processor executes the computer-implemented program stored in the memory, the processor can be caused to execute a pedestrian target detection method based on the multi-scale multi-feature neural network. Wherein the internal bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (. XtendedIndustry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus. The memory may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.
The device may be provided as a terminal, server or other form of device.
Fig. 4 is a block diagram of an apparatus shown for illustration. The device may include one or more of the following components: a processing component, a memory, a power component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing component generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component may include one or more modules that facilitate interactions between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.
The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication component is configured to facilitate communication between the electronic device and other devices in a wired or wireless manner. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.