Background
With the development and progress of artificial intelligence technology, people flow statistics technology based on video stream has been rapidly developed and has been applied to many public scenes such as scenic spots, communities and markets. However, the existing applications usually focus on tracking and counting people, the statistics of people on screen attention and the like is lacked, the attractiveness of the content displayed on the screen to users cannot be counted, and the statistical data on the aspect is very important for the commercial screen.
Content of application
Technical problem to be solved
The application provides a method for counting the flow of people in front of a screen based on a target detection algorithm, which solves the technical problem that in the prior art, only the flow of people in front of the screen can be counted, but the screen attention number concerning the screen content cannot be counted.
(II) technical scheme
In order to achieve the above purpose, the present application provides the following technical solutions:
a method for counting the flow of people in front of a screen based on a target detection algorithm comprises the following steps:
step S1: recording a historical information base, wherein a plurality of historical people head frames are stored in the historical information base;
step S2: receiving a video, wherein the video is divided into a plurality of single-frame images according to the time sequence;
step S3: performing feature extraction on the single-frame image through a target detection neural network model to obtain a plurality of target frames containing category information and position information;
step S4: filtering the target frames according to the category information and the position information to obtain a plurality of target face frames and a plurality of target head frames;
step S5: compare the target person head frame with a plurality of historical person head frames one by one, output a first matching value each time the comparison is made, and judge that the first matching value > a first threshold? If the first matching value is larger than the first threshold value, the target pedestrian head frame and the historical pedestrian head frame are judged to be the same pedestrian, the pedestrian flow number is not updated, otherwise, the target pedestrian head frame and the historical pedestrian head frame are judged to be different pedestrians, and the pedestrian flow number is added by 1; until the comparison of all target person head frames is completed;
step S6: compare the target face frame with the plurality of target face frames one by one, output a second matching value after the comparison is completed, and judge that the second matching value is greater than a second threshold? If the second matching value is larger than the second threshold value, judging that the target person head frame has the concerned screen content, namely adding 1 to the screen concerned number, otherwise, judging that the target person head frame does not pay attention to the screen content, and not updating the screen concerned number; until all target people finish comparing the head money;
step S7: and updating the historical information base, replacing the corresponding historical person head frame in the historical information base with the target person head frame when the first matching value is larger than the first threshold value, and adding the target person head frame into the historical information base and marking as the historical person head frame when the first matching value is smaller than the first threshold value.
Preferably, the target neural detection network model is established based on an SSD target detection algorithm and is obtained by inputting a real human head image and a human face image with a limited angle, and the limited angle range is a horizontal rotation angle of-45 ° to +45 °.
Preferably, the category information includes a head image, a face image and a background image, and the position information is a relative coordinate [ x ] of the target frame in the single-frame image0,y0,x1,y1]。
Preferably, step S4 includes:
step S41: filtering the long-distance target frame: calculating the width-height product of the target frame according to the relative coordinates of the target frame, and filtering the target frame with the width-height product smaller than 0.03;
step S42: filtering static target frames: obtaining central coordinate offsets dx and dy and width and height offsets dw and dh between different target frames according to the relative coordinates of the target frames, judging that the target frames with the dx, dy, dw and dh smaller than 0.02 are static in the single-frame images, counting the frame number of the different single-frame images, and judging that the object corresponding to the target frame is a static target and filtering when the static accumulated time of the target frame exceeds 1 minute;
step S43: and obtaining a target human head frame and a target human face frame for the residual target frames filtered in the steps S41 and S42 according to the category information.
Preferably, the video is a real-time video or a historical video, and the single-frame image is a video picture captured every 0.2 seconds in the video.
Preferably, the first matching value is an intersection ratio of the target human head frame and the historical human head frame, the value range of the first matching value is 0 to 1, the first threshold value is 0, when the first matching value is greater than the first threshold value 0, it is determined that the pedestrians corresponding to the target human head frame and the historical human head frame are the same pedestrian, otherwise, the pedestrians are different pedestrians, and at this time, the pedestrian flow number is increased by 1.
Preferably, step S5 further includes a verification of the human flow number: counting lines are arranged on two sides of the edge of the picture and used for detecting the entering or leaving of pedestrians in the picture and recording the number of entering people and the number of leaving people; for the target human head frame and the historical human head frame which are judged as the same pedestrian, judging the motion direction of the corresponding pedestrian according to the positions of the target human head frame and the historical human head frame between the two counting lines, and recording the entering and leaving data for the same tracked pedestrian only once;
checking the flow number of people, the number of people entering and the number of people leaving for a period of time to obtain the flow number of the checked people: and (4) verifying the people flow number ═ (the number of people entering + the number of people leaving)/2 + the people flow number ]/2.
Preferably, the second matching value includes a similarity value and a matching frequency, where the similarity value is an intersection and combination ratio of the target human head frame and the target human face frame, a value range of the similarity value is 0 to 1, the matching frequency is a frequency of successful matching between the target human head frame and the target human face frame, the second threshold includes a similarity threshold of 0.3 and a matching threshold of 15 times, that is, when the similarity value is greater than the similarity threshold, the target human head frame and the target human face frame are determined as the same pedestrian, and when the matching frequency is greater than the matching threshold, the screen attention number is increased by 1.
Preferably, updating the history information base further comprises counting the number of times of no matching success of the history head box, and deleting the history head box when the number of times of no matching of the history head box exceeds 5 times.
(III) advantageous effects
Compared with the prior art, the beneficial effects of this application are:
the application provides a method for counting the flow of people before a screen based on a target detection algorithm, wherein a target detection neural network model of image data with heads and image data with limited angles is established to output a target head frame of the pedestrian passing in a period of time before the screen, and the target head frame is tracked and synchronized with historical head frame information to complete the flow of people before the screen in a period of time; and counting the number of pedestrians who pass through the screen and pay attention to the screen content within a period of time, namely the screen attention number, by adopting the mutual matching of the target human face frame and the target human face frame with the angle limit.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1 and 4, the embodiment of the present application discloses a method for counting people flow in front of a screen based on a target detection algorithm, which is mainly used for advertisement screens in shopping malls and scenic spots with large people flow and is used for counting the people flow and the screen attention in front of the screen within a period of time, and the method comprises the following steps:
step S1: recording a historical information base, wherein a plurality of historical people head frames are stored in the historical information base;
step S2: receiving a video, wherein the video is divided into a plurality of single-frame images according to the time sequence; specifically, the video is a real-time video or a historical video, the single-frame image is a video picture captured every 0.2 seconds in the video, and in this embodiment, the video can be acquired by using a camera mounted on a screen.
Step S3: performing feature extraction on the single-frame image through a target detection neural network model to obtain a plurality of target frames containing category information and position information;
specifically, the target detection neural network model is established based on an SSD target detection algorithm and is obtained by inputting a real human head image and a human face image with a limited angle for training, wherein the limited angle range is a horizontal rotation angle of-45 degrees to +45 degrees, so that the human face image which looks at the screen in front of the screen is screened subsequently, and the number of people paying attention to the screen is counted; referring to fig. 2, the image first extracts the bottom layer visual features for small-scale targets through 15 layers of directly connected convolutional neural networks, then extracts the middle layer visual features for medium-scale targets through 6 layers of directly connected convolutional neural networks, and then further extracts the high layer visual features for large-scale targets through 6 layers of directly connected convolutional neural networks. Performing regression on the visual features of the three layers through two independent two-layer convolutional neural networks respectively to obtain category information and position information of the target frame; because the target detection neural network model can output a large number of overlapped target frames, and because the non-maximum suppression algorithm can screen the target frames with high overlapping degree, only the target frames with high confidence coefficient are reserved, and the overlapped target frames with low confidence coefficient are removed, the target frames with category information and position information output by the target neural network model are filtered by the non-maximum suppression algorithm, and finally the target frames with category information and position information are output.
Wherein, the category information comprises a human head image, a human face image and a background image, and the position information is the relative coordinate [ x ] of the target frame in the single-frame image0,y0,x1,y1]. (ii) a The face image is a face with a limited angle of-45 degrees to +45 degrees, and the face image comprises a face image with a limited angle of 0 degrees, namely a pedestrian stands in front of a screenThe method comprises the steps of directly viewing the face condition of a screen, limiting the face image with the angle of 0-45 degrees, namely the face condition that a pedestrian turns left to look at the screen in the walking process, and limiting the face image with the angle of 0-45 degrees, namely the face condition that the pedestrian turns right to look at the screen in the walking process.
Step S4: filtering the target frames according to the category information and the position information to obtain a plurality of target face frames and a plurality of target head frames;
referring to fig. 3, step S4 includes:
step S41: filtering the long-distance target frame: calculating the width-height product of the target frame according to the relative coordinates of the target frame, and filtering the target frame with the width-height product smaller than 0.03; specifically, the pedestrian flow in front of the screen mainly considers the pedestrian flow statistics that the screen content can be seen near the screen, so that the pedestrians far away from the screen are firstly filtered, and only the pedestrians within a certain distance range from the screen are counted. The average size of human head frames and human face frames corresponding to different distances is obtained through field measurement, the target frames obtained by a target detection neural network model are traversed in the statistical process, the target frames with undersize and unqualified sizes are removed, in the embodiment, a camera of a screen is tested, the product of the width and the height of the target frame at the position 3 meters away from the screen is found to be about 0.03 through the test, therefore, the target frames 2 meters away are not counted in the embodiment, and therefore the target frames with the product of the width and the height smaller than 0.03 are filtered.
Step S42: filtering static target frames: obtaining central coordinate offsets dx and dy and width and height offsets dw and dh between different target frames according to the relative coordinates of the target frames, judging that the target frames with the dx, dy, dw and dh smaller than 0.02 are static in the single-frame images, counting the frame number of the different single-frame images, and judging that the object corresponding to the target frame is a static target and filtering when the static accumulated time of the target frame exceeds 1 minute; specifically, because the environment in a market is complex, a billboard with a portrait may exist in the background of the picture, and therefore filtering of the static target frame is added. Comparing the positions and sizes of a target frame in the current single-frame image and a historical human head frame in a historical information base, when the difference between the positions and the sizes is smaller than a certain threshold value, the target of the current frame is considered to be in a static state, namely the central coordinate offsets dx and dy and the width and height offsets dw and dh between different target frames, and setting the target frames with dx, dy, dw and dh smaller than 0.02 to judge that the target frames are static in the single-frame image; and counting the number of frames of the single-frame image with the target frame, considering the target as a static background target when the static time or the number of times of the target frame exceeds a certain threshold, namely the static accumulated time of the target frame exceeds 1 minute or the number of times of the target frame at the position exceeds 300 times, determining that the object corresponding to the target frame at the position is static, and filtering the target frame at the position detected later.
Step S43: and obtaining a target human head frame and a target human face frame for the residual target frames filtered in the steps S41 and S42 according to the category information.
Step S5: compare the target person head frame with a plurality of historical person head frames one by one, output a first matching value each time the comparison is made, and judge that the first matching value > a first threshold? If the first matching value is larger than the first threshold value, the target pedestrian head frame and the historical pedestrian head frame are judged to be the same pedestrian, the pedestrian flow number is not updated, otherwise, the target pedestrian head frame and the historical pedestrian head frame are judged to be different pedestrians, and the pedestrian flow number is added by 1; until the comparison of all target person head frames is completed; specifically, the first matching value is an intersection ratio of the target human head frame and the historical human head frame, the value range of the first matching value is 0 to 1, the first threshold value is 0, when the first matching value is larger than the first threshold value 0, it is determined that the pedestrians corresponding to the target human head frame and the historical human head frame are the same pedestrian, otherwise, the pedestrians are different pedestrians, and the pedestrian flow number is increased by 1. And traversing and comparing the target head frame of the current frame with the historical head frames in the historical record library, and judging whether the two target frames are the same pedestrian or not by calculating the intersection and parallel ratio between the two frames.
If the current true target person head frame is unsuccessfully matched with the historical person head frame in the historical information base, judging that different pedestrians are present, and adding 1 to the pedestrian volume;
if the target human head frame of the current frame is successfully matched with the historical human head frame of the historical information base, the same pedestrian is judged, and then the flow number of the person is further verified through verifying the position change of the target frames judged as the same pedestrian in different time in the picture to obtain the flow number of the verified person: counting lines are arranged on two sides of the edge of the picture and used for detecting the entering or leaving of pedestrians in the picture and recording the number of entering people and the number of leaving people; for the target human head frame and the historical human head frame which are judged as the same pedestrian, judging the motion direction of the corresponding pedestrian according to the positions of the target human head frame and the historical human head frame between the two counting lines, and recording the entering and leaving data for the same tracked pedestrian only once;
checking the flow number of people, the number of people entering and the number of people leaving for a period of time to obtain the flow number of the checked people: and (4) verifying the people flow number ═ (the number of people entering + the number of people leaving)/2 + the people flow number ]/2. The finally calculated flow number of the examiners is the flow number of the persons passing through the screen within a period of time, and the period of time can be adjusted according to actual use scenes and is generally set to be 1 day.
Because the angle of the face is limited in the target detection neural network model, the pedestrian corresponding to the detected face frame can be considered to face the screen and can see the content displayed on the screen, and therefore the number of times of attention of the pedestrian to the screen, namely the screen attention number, is counted by matching the target human head frame with the target face frame. As step S6:
step S6: compare the target face frame with the plurality of target face frames one by one, output a second matching value after the comparison is completed, and judge that the second matching value is greater than a second threshold? If the second matching value is larger than the second threshold value, judging that the target person head frame has the concerned screen content, namely adding 1 to the screen concerned number, otherwise, judging that the target person head frame does not pay attention to the screen content, and not updating the screen concerned number; until all target people finish comparing the head money;
specifically, the second matching value includes a similarity value and matching times, where the similarity value is an intersection and comparison of the target human head frame and the target human face frame, a value range of the similarity value is 0 to 1, the matching times are times of successful matching of the target human head frame and the target human face frame, the second threshold includes a similarity threshold of 0.3 and a matching threshold of 15 times, that is, when the similarity value is greater than the similarity threshold, the target human head frame and the target human face frame are determined as the same pedestrian, and when the matching times are greater than the matching threshold, the screen attention number is increased by 1. Specifically, a target human head frame and a target human face frame of a current single-frame image are subjected to traversal comparison, an intersection ratio between the two frames, namely a similarity value, is calculated, when the similarity value is larger than a similarity threshold value of 0.3, the target human head frame and the target human face frame are considered to be the same pedestrian, then the matching times of the target human head frame and the target human face frame are recorded, and when the matching times exceed the matching threshold value of 15 times, it is judged that the pedestrian is concerned to annotate a screen, namely the screen attention number is increased by 1.
Because the camera is installed on the screen, only pictures at a horizontal angle can be shot, and a single-frame image overlapped by pedestrians can be shot in an occasion where the overlapping of the pedestrians is difficult to avoid, the target frame information of the current single-frame image cannot be directly used by the historical people frame recorded in the historical information base, and the historical information base needs to be updated by adopting a strategy like step S7, so that the historical people frame corresponding to the pedestrian in the historical information base cannot be lost when the pedestrian disappears in the picture and reappears in the picture during short overlapping.
Step S7: and updating the historical information base, replacing the corresponding historical person head frame in the historical information base with the target person head frame when the first matching value is larger than the first threshold value, and adding the target person head frame into the historical information base and marking as the historical person head frame when the first matching value is smaller than the first threshold value. Further, updating the historical information base also comprises counting the times of no matching success of the historical person head frame, and when the times of no matching of the historical person head frame exceeds 5 times, deleting the historical person head frame.
It should be noted that although embodiments of the present application have been shown and described, it would be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the application, the scope of which is defined in the appended claims and their equivalents.