Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multi-human-body sports counting method, a multi-human-body sports counting system and a multi-human-body sports counting medium based on multi-key-point detection.
In order to solve the technical problems, the invention adopts the following technical scheme:
A multi-human body breeding item counting method based on multi-key point detection comprises the following implementation steps:
1) Video stream data of a venue is acquired.
2) Preprocessing video frames in a video stream to obtain video frames finally used for key point detection, wherein the preprocessing of the video frames comprises sampling the video frames, pre-storing the video frames and possible scaling operation of the video frames.
3) Detecting the boundary frame position of a person to be detected in a video frame, intercepting an image area for key point detection, wherein a target detection model based on YOLO V8 is used for detecting the boundary frame position, the model has a high reasoning speed and can meet the requirement of real-time detection, the area of the identified person to be detected is intercepted for subsequent key point detection on one hand, and is required for subsequent target tracking tasks on the other hand, after target detection is completed, the person to be detected is required to be tracked, a DeepSort algorithm is used for the part, and a unique and unchanged ID is allocated to each person to be detected, so that subsequent counting operation is convenient.
4) Detecting coordinate values of 33 key points in a human body, wherein the coordinate values comprise 11 heads, 10 arms (two sides), 4 trunk parts, 2 legs (two sides) and 6 feet (two sides), and a mediapipe gesture detection module is used;
5) Counting corresponding items is completed for different sports items, taking sit-up item counting as an example, defining four states of a person to be detected, namely a ready state, a sit-up state and an end state, respectively judging the transition between the states by using corresponding rules, and completing sit-up action when the person to be detected completes the state transition from the sit-up state to the sit-up state completely.
A multiple-person item-in-play counting system based on multiple-keypoint detection, comprising a computer device programmed or configured to perform the steps of the multiple-person item-in-play counting method based on multiple-keypoint detection as described in any one of the above, or a computer program programmed or configured to perform the multiple-person item-in-play counting method based on multiple-keypoint detection as described in any one of the above is stored on a storage medium of the computer device.
A computer readable storage medium having stored thereon a computer program programmed or configured to perform the multiple sports item counting method based on multiple keypoint detection of any of the above.
Compared with the prior art, the method has the advantages that preprocessing is carried out on the video stream, the problem of video stream cache overflow when the processing speed is insufficient is prevented, multi-user multi-key-point detection is realized, the accuracy and the accuracy of a counting algorithm are improved while the key-point detection efficiency is considered, meanwhile, due to the fact that DeepSort is used for target tracking, the situation of tracking failure caused by instantaneous shielding can be effectively avoided, and counting work of sports of multiple users can be accurately completed under different scenes of background, illumination intensity and the number of people.
Detailed Description
As shown in fig. 1, the implementation steps of the multi-body-play counting method based on multi-key point detection in this embodiment include:
1. Video stream data of a venue is acquired, which is typically acquired by an image capturing apparatus supporting video streaming by wired or wireless means, typically an image capturing apparatus supporting RTSP, RTMP or encoded video streaming. Meanwhile, the method also supports the local camera equipment, and then carries out corresponding counting (timing) on the stored video to be detected.
2. Preprocessing video frames in a video stream to obtain video frames finally used for key point detection, wherein the preprocessing comprises the following detailed steps of
2.1 Video frames in the original video stream are uniformly sampled, and for video streams above 30FPS, the frame rate after sampling is 20 frames per second.
2.2 A data structure called FramePool is created, the size of which is determined according to hardware conditions such as memory of a specific computing device, the video frames sampled in 2.1) are preferentially stored in FramePool, when FramePool capacity is insufficient and new video frames arrive, the video frames which enter first in FramePool are preferentially deleted, and the newly arrived video frames are stored in FramePool. Meanwhile, the video frames which enter FramePool first are taken from FramePool for processing in the subsequent process, and the processes are all asynchronous operation and have corresponding locking mechanisms to ensure the accuracy of FramePool access.
2.3 For video frames sampled from FramePool, if their resolution is higher than 1280 x 720, they are scaled to 1280 x 720 for convenient subsequent processing. After the preprocessing, the problem of video stream cache overflow caused by insufficient computing power of the computing equipment is effectively avoided.
3. Detecting the boundary frame position of a person to be detected in a video frame, intercepting an image area for key point detection, and simultaneously carrying out target tracking on the person to be detected, wherein the specific implementation steps are as follows:
3.1 The method comprises the steps of 1) detecting a person to be detected in a video frame by using a target detection model based on YOLO V8, wherein the model BackBone adopts a C2f module as a basic structural unit, and particularly comprises 5 convolution modules and 4C 2f modules, compared with a C3 module of the previous generation, the model has fewer parameters and more excellent feature extraction capacity, and meets the requirement of real-time detection on video stream data, the model Neck adopts a multi-scale feature fusion technology to fuse feature graphs from different stages of a back to enhance feature representation capacity, the part plays an important role in feature extraction and feature fusion, and a Head part is responsible for final target detection and classification tasks, and comprises a detection Head and a classification Head. The detection head comprises a series of convolution layers and deconvolution layers for generating detection results, and the classification head adopts global average pooling to classify each feature map. Cutting out the region to be detected of the key points in the image, namely the region containing the personnel to be detected, according to the obtained binding box information.
3.2 According to the marking box information obtained in 3.1) and the intercepted key point to-be-detected area image, carrying out target tracking on the to-be-detected person in the video frame by using DeepSort algorithm. The specific implementation steps are shown in fig. 2, the feature extraction is carried out on the image intercepted in 3.1), the feature is used for cascade matching subsequently, wherein the structure of the feature extraction network is shown in fig. 3, a front convolution layer is composed of 2D convolution and BN layers, the front convolution layer is processed by a convolution layer composed of basic blocks after ReLU and maximum pooling operation, the basic blocks are composed of 2D convolution layers connected with BN, an activation function uses ReLU, in particular, the basic blocks provide optional downsampling operation, a backbone network is composed of 8 basic blocks and an average pooling layer, each two basic blocks are in a group, and if the group of basic blocks are selected to be downsampled, only the first basic block in the group is downsampled, namely, 3 rd, 5 th and 7 th basic blocks in the current structure are downsampled. The classifier consists of a linear layer and a BN layer, and a final classified linear layer is linked after the ReLU activation and the Dropout operation to obtain a result, and the extracted features can be used for subsequent cascade matching.
For tracks of the bounding box, namely trace, two states, namely a confirmed state and a non-confirmed state, the non-confirmed state track can be converted into the confirmed state track under a certain condition, different processing rules are provided for the non-confirmed state track and the confirmed state track in the subsequent matching process, and each existing track predicts the subsequent track through Kalman filtering (hereinafter referred to as KF for short).
For a frame of a character boundary box detected for the first time, initializing the boundary box into an original track, wherein all tracks are in a non-confirmation state when being generated, for the tracks in the non-confirmation state, performing IOU matching on a KF predicted track of the tracks with a detection frame of a subsequent video frame, generating three matching results through a Hungary algorithm, deleting the track if the tracks are not successfully matched (the tracks can be deleted directly because the tracks are in the non-confirmation state), initializing the detection frame into a new track if the detection frame is not successfully matched, and updating the track according to the position of the detection frame if the tracks are successfully matched.
If a trace matches a detected bounding box a number of times in succession, the trace transitions to the validated state, the number of times in succession being a variable threshold, typically set to 3. For the confirmation state track, the KF predicted track is subjected to cascade matching with the detection frame in the subsequent video frame, the cascade matching uses the characteristics extracted by using the deep neural network, and the matching has two results, if the detection frame is not matched or the track is not matched, the subsequent IOU matching is performed, and if the matching is successful, the track information is updated according to the position of the detection frame. For a validated track, if the IOU fails to match, the track is not deleted directly, but is deleted after a number of consecutive failures, which is a variable threshold, typically set to 30.
4. The coordinate values of 33 key points in the human body are detected, see fig. 4, which comprises 11 heads, 10 arms (two sides), 4 trunk, 2 legs (two sides) and 6 feet (two sides), wherein a specific key point parameter is 0-nose、1-left eye(inner)、2-left eye、3-left eye(outer)、4-right eye(inner)、5-right eye、6-right eye(outer)、7-left ear、8-right ear、9-mouth(left)、10-mouth(right)、11-left shoulder、12-right shoulder、13-left elbow、14-right elbow、15-left wrist、16-right wrist、17-leftpinky、18-rightpinky、19-left index、20-right index、21-leftthumb、22-right thumb、23-lefthip、24-righthip、25-left knee、26-right knee、27-left ankle、28-right ankle、29-leftheel、30-right heel、31-left foot index、32-right foot index.. The key point detection model uses a POSE module of mediapipe, and compared with a model of mainstream YOLO-POSE and the like, the key point detection model has the advantages of more key points identified by the module, higher identification speed and unique advantages for subdividing key points of the feet of the human body when the key point detection model is used for processing items requiring fine foot judgment. The method has the defects that as a bottom-up model architecture is used, namely, all possible human body key points in an image are identified, and a plurality of key points with highest confidence level are screened to form a human skeleton structure, the model only supports the detection of single human body key points, and when a plurality of persons to be detected exist in the image, the model only outputs one 'most credible' key point of the person to be detected. Therefore, aiming at the problems, the method adopts a mode of combining the model with target detection and target tracking, and realizes multi-key-point and rapid multi-person bone key-point identification.
5. For different sports items, the corresponding items are counted through a corresponding counting algorithm, and the sit-up item counting is taken as an example, referring to fig. 5, the specific steps of the embodiment are as follows:
5.1 The states of the personnel to be detected are defined, and the states are respectively a ready state, a supine state, a sitting state and an ending state, and the specific definition of each state is as follows. The method comprises the steps of preparing a person, namely, the person is in a picture and is detected by a target detection model, but does not start sit-up, lying in a test position, wherein the person is in a supine state, the trunk is horizontal, the scapula touches the ground, the legs are naturally bent, the hands are placed on two sides of the body and are close to the ground, the soles touch the ground, the person is in a sitting state, the person is detected to lift the upper body to a certain angle by means of abdominal force, fingers are simultaneously moved forward to a standard line position, feet and buttocks are forbidden to leave the ground in the process, and the person to be detected enters an ending state after standing from the supine state or the sitting state, and count judgment is not performed after the person to be detected enters the ending state.
5.2 After the personnel to be detected appear in the video picture and are detected by the target detection model, the tracking algorithm gives the personnel a unique ID, the personnel enter the preparation state, and the key points of the human bones of the personnel to be detected are identified at the moment.
5.3 For the detected key points, confidence level screening is carried out before subsequent use, the confidence level is smaller than a set threshold value, the data with the key points on both sides is not taken, and if the key points on both sides are larger than the set threshold value, the average value is taken, so that misjudgment caused by errors of a key point detection model is reduced.
5.4 The method comprises the steps of determining that a person to be detected is in a horizontal lying state rather than an upright state, wherein the difference of x-direction coordinates of head key points and foot key points is larger than the difference of y-direction coordinates, the difference of x-direction coordinates of hip key points and foot key points is larger than the difference of y-direction coordinates, the head key points use nose position key point coordinates, the foot key points use three key points of feet, namely ankle, heel and toe coordinate average values, meanwhile, the person needs to be guaranteed to lie on the ground horizontally, the scapula is close to the ground, namely, the angle formed by the head key points, the hip key points and the three key points of the foot is required to be close to 180 degrees, the legs are naturally bent, the hip key points are required to be bent, the angle formed by the three key points of the foot is required to be close to 90 degrees, and the foot contacts the ground, namely, the toe is met simultaneously under the requirement of horizontal lying, and the angle between the heel and the three key points of the hip is close to 180 degrees. Wherein, the method for calculating the angle according to the coordinates of the three points is as follows
Note that where x2, y2 are the angled vertex keypoint coordinates.
5.5 And (3) sitting and starting to judge whether the person enters the sitting and starting state after the person to be detected enters the supine state. Firstly detecting whether the hip lifting, the foot lifting and other phenomena occur in a frame in the sitting process, if so, performing counting, recording the hip and foot key point coordinates of the person when the person enters a supine state, judging whether the error occurs according to the hip and foot y direction coordinate change of the person in the sitting process, wherein the foot lifting comprises foot lifting and foot lifting, for the foot lifting and the foot lifting, judging whether the two errors occur simultaneously only by comparing the average value of the key points of the three key points of the foot, namely the ankle, the heel and the toe with the average value of the initially recorded foot key points, if not, detecting the rising angle of the person according to the angle formed by the key points of the shoulder, the hip and the foot, judging the rising angle of the person according to the angle formed by the key points of the shoulder, the hip and the foot, and judging whether the person meets the requirement when the angle meets a certain threshold value, normally setting 150 degrees, namely the upper body and the ground of the person to be detected forms an angle of more than 30 degrees, finally detecting whether the hand moves to a standard line, and judging whether the hand passes through the line according to the finger coordinate and the standard line position.
5.6 Ending state judgment, namely judging that the current state of the personnel to be detected is in a supine state or a sitting state, and the current personnel no longer meet the horizontal lying state described in 5.4), namely, if the personnel to be detected is up, the current personnel enters the ending state, and counting judgment is not performed after the current personnel enter the ending state.
5.7 The person to be detected enters the sit-up state from the supine state and then enters the supine state, so that sit-up actions of the person are considered to be completed, the sit-up count of the person is correspondingly increased, and the person to be detected can be obtained by changing the preparation state and the sit-up state, the sit-up state can only be obtained by changing the supine state, and the end state can be obtained by changing the supine state and the sit-up state.
Finally, it is noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and that other modifications and equivalents of the technical solution of the present invention can be made by those skilled in the art without departing from the spirit and scope of the technical solution of the present invention, and the scope of the claims of the present invention shall be covered.