Movatterモバイル変換


[0]ホーム

URL:


CN118968629A - Automatic evaluation method of sit-up action quality based on posture key points - Google Patents

Automatic evaluation method of sit-up action quality based on posture key points
Download PDF

Info

Publication number
CN118968629A
CN118968629ACN202411161998.3ACN202411161998ACN118968629ACN 118968629 ACN118968629 ACN 118968629ACN 202411161998 ACN202411161998 ACN 202411161998ACN 118968629 ACN118968629 ACN 118968629A
Authority
CN
China
Prior art keywords
key
sit
action
motion
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411161998.3A
Other languages
Chinese (zh)
Other versions
CN118968629B (en
Inventor
王汝尧
杨睿
刘国忠
白培瑞
刘庆一
修晓娜
丁浩然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and TechnologyfiledCriticalShandong University of Science and Technology
Priority to CN202411161998.3ApriorityCriticalpatent/CN118968629B/en
Publication of CN118968629ApublicationCriticalpatent/CN118968629A/en
Application grantedgrantedCritical
Publication of CN118968629BpublicationCriticalpatent/CN118968629B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明属于视频动作质量评估技术领域,具体公开了一种基于姿态关键点的仰卧起坐动作质量自动评估方法,该方法首先将全局注意力机制GAM和多尺度注意力机制EMA融入YOLOv8‑Pose网络,使动作姿态估计和关键点检测能力得到很大提升,有效提高了对人体姿态的检测能力和骨骼关键点的定位能力,使得评分网络的动作质量预测分数更加准确。评分网络使用了基于骨架关键点的思想,对仰卧起坐视频图像中最能表征仰卧起坐动作的6个骨架关键点进行提取,最后通过仰卧起坐各个动作阶段动作要点的关键角度的特征不同,使用加权评分方法进行动作质量评分,利于提高对仰卧起坐动作完成质量自动评分的准确性。

The present invention belongs to the technical field of video action quality assessment, and specifically discloses a sit-up action quality automatic assessment method based on posture key points. The method first integrates the global attention mechanism GAM and the multi-scale attention mechanism EMA into the YOLOv8-Pose network, so that the action posture estimation and key point detection capabilities are greatly improved, and the detection capability of human body posture and the positioning capability of skeleton key points are effectively improved, so that the action quality prediction score of the scoring network is more accurate. The scoring network uses the idea based on skeleton key points to extract the six skeleton key points that best represent the sit-up action in the sit-up video image, and finally uses the weighted scoring method to score the action quality according to the different characteristics of the key angles of the action points in each action stage of the sit-up, which is conducive to improving the accuracy of the automatic scoring of the quality of the sit-up action completion.

Description

Automatic sit-up motion quality assessment method based on gesture key points
Technical Field
The invention belongs to the technical field of video motion quality assessment, and particularly relates to an automatic sit-up motion quality assessment method based on gesture key points.
Background
The video motion quality assessment (Action Quality Assessment, AQA) is an automated technique capable of monitoring and assessing human motion quality in real time, and has significant advantages in the field of sports testing. Currently, there are two main processing strategies for motion quality assessment based on video analysis and understanding. The first strategy is to directly process and analyze the video stream, and provide basis for motion quality evaluation by extracting motion characteristics and time-space correlation thereof in the video. The second strategy is based on skeleton key point motion trail analysis and evaluation of the completion quality of the motion, and the skeleton feature of the video needs to be extracted by the skeleton key point analysis method, which may be affected by skeleton extraction performance, however, the calculation amount can be effectively reduced, and only the key point motion features such as joint angle, speed and the like need to be tracked, so that the method is also focused in the fields of motion recognition and motion quality evaluation.
Sit-ups are a conventional physical examination project covering middle and primary schools and university sports teaching, currently mainly depend on sports teachers to check in a visual examination counting mode, the motion completion quality of a subject depends on professional experience of referees, subjectivity is strong, time and labor are consumed, non-standard motions are easy to miss, and reasonable deduction is difficult to make for the non-standard motions. At present, although a video auxiliary referee action quality evaluation system of a large sports competition field has achieved a certain application effect. However, in the field of mass-oriented miniature sports test, intelligent evaluation and action quality analysis and automatic evaluation technologies are not mature enough, and the problems of long detection time, insufficient accuracy and the like exist, and meanwhile, reasonable scoring standards and deduction bases for nonstandard actions are also lacking. For example:
Patent document 1 provides a sit-up motion specification detection method, apparatus, and storage medium for detecting whether sit-up motion is normal. The sit-up action specification detection method comprises the steps of collecting video images of sit-ups; calculating a normative degree score of sit-up actions from the i-th frame to the i+n1 -th frame from the i-th frame of the video image; and when the normative degree score is more than 0, determining that the sit-up motion is normative, and otherwise, determining that the sit-up motion is not normative. The patent document only judges whether the motion is normal or not, the non-normal score is directly 0 score, and the specific score and the deduction standard aiming at the non-normal motion are not provided.
Patent document 2 provides a comprehensive physical fitness evaluation method, system and storable medium based on machine vision, by acquiring, processing and judging field environment data of a tester, acquiring action video data of the tester when the action video data accords with a first environment condition, acquiring first key point information and first key point coordinate information of the tester according to the action video data, and determining a completion condition according to the information; when the motion video data of the tester is met with the second environmental condition, the motion video data is also acquired, the motion video data is analyzed frame by frame, corresponding second key point coordinate information is obtained according to the second key point information, the motion state of the tester is determined to be in pull-up direction according to the second key point coordinate information, and the motion completion message of the tester is determined according to the second preset condition
And displays in real time, the patent document can detect the completion of sit-ups and pull-ups of a tester.
However, the number of key points to be adopted in the patent document 2 is large, some key points such as eyes and eyebrows are of little value for evaluation of sit-ups, and the data processing time is also affected, and the patent document 2 only performs counting study on whether the motion is standard or not, and does not develop detailed motion score and score standard.
Reference to the literature
Patent document 1 chinese invention patent application publication No.: CN118015706a, publication date: 2024.05.10;
patent document 2 chinese invention patent application publication No.: CN117877120a, publication date: 2024.04.12.
Disclosure of Invention
The invention aims to provide an automatic sit-up motion quality assessment method based on gesture key points, which is based on YOLOv-Pose networks to provide an improved key gesture estimation and key point detection network, performs motion gesture estimation and key point detection on video key motion frames, performs sit-up motion quality weighted scoring through key angle characteristics, and is beneficial to improving accuracy of automatic scoring of single complete sit-up motion completion quality.
In order to achieve the above purpose, the invention adopts the following technical scheme:
The sit-up action quality automatic evaluation method based on the gesture key points comprises the following steps:
Step 1, firstly, acquiring a sit-up video data set, and extracting and labeling video key action frames of the acquired video;
Step 2, constructing a key posture estimation and key point detection network based on an improved YOLOv-Pose model, and transmitting the key frames extracted in the step 1 into the key posture estimation and key point detection network to perform posture estimation and key point detection;
And step 3, finally, sending the key point data obtained by the key gesture estimation and key point detection network into a score estimation network to perform weighted scoring of key action points, and finishing quality estimation of sit-up actions.
The invention has the following advantages:
As described above, the invention relates to an automatic evaluation method for sit-up motion quality based on gesture key points, which is characterized in that a detailed scoring standard and a weighted scoring scheme are formulated according to the sit-up motion characteristics, and the sit-up motion quality can be evaluated more accurately, reasonably and effectively in detail than the prior art for simply counting sit-ups. Specifically, the invention firstly blends a global Attention mechanism (Global Attention Mechanism, GAM) and a multi-Scale Attention mechanism (EFFICIENT MULTI-Scale Attention, EMA) into a YOLOv-Pose network, so that the motion gesture estimation and key point detection capability can be greatly improved, the human gesture detection capability and the bone key point positioning capability can be effectively improved, and further the motion quality prediction score of a scoring network is more accurate. The grading network provided by the invention uses the thought based on skeleton key points, extracts 6 skeleton key points which can represent sit-up actions most in the sit-up video image, overcomes the influence of background environment, can improve the model speed to a certain extent, and finally performs action quality grading by using a weighted grading method through different characteristics of key angles of action key points of each action stage of sit-up, thereby being beneficial to improving the accuracy of automatic grading of the completion quality of single complete sit-up actions. The invention is researched aiming at the specific measurement project of sit-ups, is closer to the actual demand, and can provide a reasonable reference scheme for the work such as the teaching and the testing of the sit-ups.
Drawings
FIG. 1 is a flowchart of a sit-up motion quality automatic assessment method based on gesture key points in an embodiment of the present invention;
FIG. 2 is a schematic diagram of video key frame extraction in an embodiment of the present invention;
FIG. 3 is a diagram of a network for key pose estimation and key point detection based on YOLOv-Pose model improvement in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a GAM module according to an embodiment of the invention;
FIG. 5 is a schematic diagram of an EMA module according to an embodiment of the invention;
FIG. 6 is a diagram of a score evaluation network in accordance with an embodiment of the present invention;
FIG. 7 is a schematic view of 4 standard motion attitudes for a sit-up completion procedure;
Wherein P1 in fig. 7 is a standard posture at the beginning, P2 in fig. 7 is a standard posture of upper body lifting, P3 in fig. 7 is a standard posture of abdomen lifting, and P4 in fig. 7 is a standard posture of sit-up ending;
FIG. 8 is a schematic diagram of bone keypoint capture in accordance with an embodiment of the present invention;
wherein (a) in fig. 8 is a schematic diagram of 6 skeleton key points and their coordinates, (b) in fig. 8 is a key frame in the collected sit-up video, a red frame marks a region of interest, and (c) in fig. 8 is 6 skeleton key points captured by the frame.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and detailed description:
the embodiment describes a sit-up motion quality automatic evaluation method based on gesture key points, as shown in fig. 1, comprising the following steps:
And 1, firstly, acquiring a sit-up video data set, and extracting and labeling video key action frames of the acquired video.
Since there is no video dataset for sit-up actions in the existing public dataset, the present invention self-builds a set of sit-up video dataset, which can be defined as SDUST-Situp, SDUST-Situp dataset comprising 32 groups of 108 videos, each group comprising a plurality of consecutive sit-up standard and non-standard action videos.
And marking action quality scores of sit-up actions in the video, and then performing frame extraction operation on the video.
The pretreatment operation is specifically as follows:
Firstly, extracting 16 frames from each video by using a uniform frame extraction strategy; and then, carrying out frame similarity estimation on 16 frames of each video by using a video similarity learning network ViSiL, removing redundant frames with higher similarity, and finally reserving 4 key frames, namely a start key frame, an upper body lifting key frame, an abdomen lifting key frame and an end key frame, as shown in fig. 2. In this way, not only can the key features of a single sit-up action be preserved, but the data size required to be processed is reduced.
The uniform frame extraction strategy is specifically as follows:
Firstly, reading an input sit-up video file, obtaining the total frame number of each video, setting a fixed threshold value for 16 frame number extraction, calculating a frame extraction interval step (step=total frame number/16) according to the total frame number of the video, extracting one frame of image every step frame for storage, and uniformly storing 16 frames of images for each video after cyclic processing.
And carrying out inter-frame similarity calculation on the 16-frame images stored after each video is uniformly extracted, and removing redundant frames with higher similarity by using ViSiL similarity learning network, wherein the specific process is as follows:
Firstly, sequentially extracting spatial information of 16 frames of images of each video by using a Convolutional Neural Network (CNN), wherein the spatial information comprises characteristics such as color, texture, shape and the like; after extracting the spatial features, further capturing timing information between frames using a Recurrent Neural Network (RNN), such as features of motion and variation;
And secondly, fusing the space-time characteristics by adopting a bilinear fusion method, and calculating cosine similarity between every two frames of characteristics.
Finally, in order to reject redundant frames with higher similarity, a similarity threshold is set, for example, 0.8, when the similarity between two frames exceeds the threshold, the previous frame is selected to be reserved, the next frame is rejected, the processing is sequentially carried out, and finally 4 key frames are reserved.
And 1.2, marking the extracted key frames by adopting lightweight graphic marking software Labelme.
The key gestures are marked by the minimum circumscribed rectangle of the target, the labels are Supine, lift, rise, achieve respectively, and represent the starting, upper body lifting, abdomen lifting and ending actions respectively; the key point labeling labels of the human body gesture are wrist, elbow, shoulder, hip, knee, ankle respectively, and represent six gesture key points of wrists, elbows, shoulders, hips, knees and ankles respectively.
Through the processing, 404 pieces of key posture data and 2424 pieces of key point data can be obtained, and the labeling result is sent into a key posture estimation and key point detection network for learning training so as to obtain key posture and key point detection results.
And 2, constructing a key posture estimation and key point detection network based on the improved YOLOv-Pose model, and transmitting the key frames extracted in the step 1 into the key posture estimation and key point detection network to perform posture estimation and key point detection.
Since YOLOv-Pose model has excellent performance on key point detection, the invention is improved on the basis of YOLOv-Pose model, and is used for posture estimation and key point detection of sit-up actions, and the model structure is shown in figure 3.
The modified YOLOv-Pose model includes a backbone network, a neck network, and a head network.
The main network leads the network to pay attention to the relevant characteristic areas of the human body more comprehensively by introducing a global attention mechanism GAM, and effectively integrates the context information among different channels of the characteristic diagram so as to improve the accuracy and the robustness of human body posture detection; then forming aggregation features on multiple scales through the space pyramid pooling layer; and then, the importance of different scale features is adaptively emphasized by introducing a multi-scale attention mechanism EMA, so that the detection and positioning capability of key points of human bones is improved.
The output of the main network is followed by fusion and enhancement of the characteristics through the neck network; and finally, deciding by using a detection head network, generating final detection, and finally outputting a key gesture and a key point detection result by using a YOLOv-Pose model.
The backbone network includes a convolution module, a C2f module, a global attention mechanism GAM module, a spatial pyramid pooling layer, and a multi-scale attention mechanism EMA module, as shown in fig. 3. The processing flow of the signals in the backbone network is as follows:
The input key frame image sequence firstly carries out image feature extraction through two convolution modules, then captures complex features in the image through one C2f module, then carries out image feature extraction through one convolution module, then captures complex features in the image through one C2f module, carries out image feature extraction through one convolution module, and then enters a global attention mechanism GAM, so that the network focuses on the relevant feature areas of the human body more comprehensively, and context information is effectively integrated among different channels of the feature map; then forming aggregation features on multiple scales through the space pyramid pooling layer; the importance of different scale features is adaptively emphasized by introducing a multi-scale attention mechanism EMA so as to improve the detection and positioning capability of key points of human bones.
The specific structure of the global attention mechanism GAM module is shown in fig. 4. The method can effectively extract the characteristic information on the basis of cross channels and three dimensions, and meanwhile, the integrity of the original information is maintained.
The global attention mechanism GAM module includes a channel attention module MC and a spatial attention module MS.
The input feature F1 first captures its important information in the spatial dimension F2 further by the channel attention module MC; f2 then proceeds to spatial attention module MS for further processing to highlight highly correlated channel features.
The process flow is represented as follows:
(1)
Wherein the method comprises the steps ofRepresenting the multiplication by element,Respectively, the channel attention process and the spatial attention process.
By introducing the GAM module, the YOLOv-Pose network can more comprehensively pay attention to the relevant characteristic areas of the human body, and effectively integrate the context information among different channels of the characteristic diagram so as to improve the accuracy and the robustness of human body posture detection.
A specific structure of the multiscale attention mechanism EMA module is shown in fig. 5. The parallel substructure is adopted to reduce the network depth, so that better pixel-level attention is generated for the advanced feature map under the condition of not reducing the channel dimension, and the multi-dimensional perception and multi-dimensional feature extraction capacity are improved. The processing flow of the signal in the multiscale attention mechanism EMA module is as follows:
The input aggregate features are first divided into a plurality of sub-features to form feature sets, which are processed through three parallel lines in two branches. The first branch is 1*1 branches, the first branch comprises two parallel lines, a one-dimensional horizontal global pooling and a one-dimensional vertical pooling are adopted to encode feature groups along two space directions respectively, then two encoding features are connected and shared 1*1 to be convolved, then the output of the 1*1 convolution is decomposed into two vectors and then respectively and correspondingly passes through a nonlinear Sigmoid activation function, and then the two vectors and the feature groups are subjected to re-weighting operation and grouping normalization, and finally the features are remolded through average pooling and normalized by using a Softmax normalization function; the second branch is 3*3 branches, the feature group captures local cross-channel interaction through 3*3 convolution to expand the feature space, the feature is remodelled through average pooling and normalized by using a Softmax normalization function, and matrix multiplication is performed on the feature subjected to grouping normalization with the first branch by using a Matmul function to obtain a first feature matrix; meanwhile, the features subjected to 3*3 convolutions are subjected to matrix multiplication by using Matmul functions and features normalized by the first branch normalization functions to obtain a second feature matrix; adding the first characteristic matrix and the second characteristic matrix, and generating an attention weight matrix through a Sigmoid activation function; and finally, carrying out re-weighting operation on the input feature group and the attention weight matrix to obtain output features optimized by the EMA attention mechanism.
According to the invention, by introducing the EMA module, the YOLOv-Pose network can adaptively emphasize the importance of different scale characteristics, so that the detection and positioning capability of key points of human bones is improved.
According to the invention, the action characteristics of sit-ups and the actual measurement environment are considered, a Situp-PoseNet model based on YOLOv-Pose network is provided, a global Attention mechanism (Global Attention Mechanism, GAM) and a multi-Scale Attention mechanism (EFFICIENT MULTI-Scale Attention, EMA) are integrated into Yolov-Pose network, the detection capability of human body gestures and the positioning capability of skeleton key points are effectively improved, and the evaluation result of a grading network is more accurate.
And step 3, finally, sending the key point data obtained by the key gesture estimation and key point detection network into a score estimation network to perform weighted scoring of key action points, and finishing quality estimation of sit-up actions.
The invention designs a score evaluation network for finishing quality scoring standard and weighted scoring based on video analysis and understanding of sit-up actions. Specifically, 4 key stages (postures) in the sit-up process are selected as evaluation objects, and different weights are given to the 4 key postures. Each key gesture has 3 action points including holding the head of two hands, bending the knees by 90 degrees and folding the body, and each action point also has respective weight. And evaluating the motion completion quality and the nonstandard motion according to the motion gist of each stage, thereby completing the motion quality weighted scoring of the sit-up.
Compared with the prior manual scoring strategy with stronger subjectivity and the method of only focusing on action counting and easily neglecting action quality, the scoring strategy provided by the invention focuses on the assessment of the stage, consistency and action standardization degree of the action.
By subdividing key postures and action points and giving different weights to the key postures and the action points, the method can more reasonably reflect the accuracy, the standstill and the action points, help learners better understand the correct postures and the action points of the sit-ups, enable the learners to pay more attention to the quality and the standstill of the actions, not just pursue quantity, and achieve more scientific and fair scoring.
The score evaluation network structure is shown in fig. 6, and the overall processing thought is as follows:
After the key posture estimation and key point detection network obtains the positions and corresponding coordinates of each key point of the sit-up key frame, the key point positions and corresponding coordinates are sent to the scoring network for calculating the angles of the key points, the distribution of weight coefficients and judgment of deduction items are carried out, and finally, final action quality prediction scores are obtained through weighted summation, so that the quality assessment of sit-up actions is completed.
According to the 1 minute sit-up test standard, in order to achieve the quality score for the completion of the sit-up actions, 4 key stages (postures) in the sit-up process are selected as evaluation objects, namely:
The sit-up starting posture P1 is characterized in that a subject lies on the back on a cushion, two shoulder blades touch the cushion, the knee is bent by about 90 degrees, and the hands hold the head; the upper body lifting gesture P2 is characterized in that the upper body of the subject is lifted off the soft cushion, and the hands are held; the abdomen lifting gesture P3 is characterized in that the subject finishes the sitting stage by means of abdomen force and holds the hands with the head; the sit-up ending posture P4, the motion features the subject sitting up with both elbows touching or exceeding the knees.
The standard attitudes of P1 to P4 are schematically shown in fig. 7. Wherein P1 is the standard posture at the beginning of the test, P2 is the standard posture of the upper body lifting, P3 is the standard posture of the abdomen lifting, and P4 is the standard posture of the sit-up ending.
The motion quality of sit-ups is automatically scored based on video analysis and understanding, and the positions of skeleton key points of the 4 key stages need to be accurately tracked and matched with standard gestures to obtain scores.
As shown in (a) of fig. 8, six key points a to F in the video action skeleton are selected as tracking targets for pose estimation; wherein the A point is positioned at the shoulder joint, and the coordinate mark is (x1,y1); point B is located at hip joint (x2,y2), point C is located at knee joint (x3,y3), point D is located at elbow joint (x4,y4), point E is located at wrist joint (x5,y5), and point F is located at ankle joint (x6,y6). Fig. 8 (b) is an example frame of the P3 phase, the region of interest is marked with a rectangular box, and fig. 8 (c) is 6 key points captured by the frame.
In order to evaluate the motion completion quality of each stage, judging according to the importance of the key gesture and the motion completion quality of each key gesture; first, the 4 key stages are given different weights, namely, P1 is 0.3, P2 is 0.2, P3 is 0.2, and P4 is 0.3. And then evaluating the motion completion quality and the nonstandard motion according to the motion key points of each stage, wherein the evaluation standards are as follows: and (5) correspondingly scoring the degrees of head holding, knee bending by 90 degrees and body folding of the hands according to the angle matching degree of the key points.
The specific weights and scores for each action point are shown in table 1.
Table 1 sit-up action scoring method based on key point angles
Specifically, the specific process of evaluating the motion completion quality and nonstandard motion is as follows:
The normalization of the head holding action of the two hands is defined and judged by the folding angle D of the arms; when the angle D is less than or equal to 45 degrees, the standard motion of holding the head of the two hands is considered, and the score is 5;
When the angle D is larger than 45 degrees, the scores are respectively 4, 3, 2, 1 and 0 according to the nonstandard degree of the action.
The calculation formula of the angle D is as follows:
(2)
Judging whether the knee bending of the subject is 90 degrees or not through the included angle C of the legs; when the included angle C is more than 80 degrees and less than 95 degrees, the knee bending standard action is considered to be obtained, and the score is 5;
and when the included angle C is not in the angle range, the included angles are sequentially divided into 4, 3, 2, 1 and 0 according to the nonstandard degree of the action.
The calculation formula of the angle C is as follows:
(3)
Whether the subject lies on the back and sits up is judged according to the folding degree of the body, namely, when the folding angle B of the body is more than or equal to 120 degrees, the subject is considered to be a standard action of lying on the back, and the score is 5 minutes;
when the angle B is smaller than 120 degrees, the scores are sequentially 4, 3, 2, 1 and 0 according to the nonstandard degree of the action.
The standard degree of the sitting-up motion is judged according to the angle B', namely when the folding angle of the body is less than or equal to 60 DEG and the elbow touches or exceeds the kneesWhen the time is up to 5 minutes;
when the angle B' is larger than 60 degrees, the score is 4, 3,2, 1 and 0 according to the nonstandard degree of the action.
The calculation formulas of the angle B and the angle B' are the same, as shown in the formula (4):
(4)
To sum up, based on the 4 key stages PM of sit-up video and the 3 key action points PMN of each stage, i.e. standstill of both hands, bending knees by 90 ° and folding body, the final predictive scoring formula is:
(5)
Where SP is the final prediction score, αMN is the weight of each stage key action point, M is the key stage number, N is the key action point number, and βM is the weight of each key stage.
There is no published dataset of sit-up motion video, and in order to verify the effectiveness of the method of the present invention, the present invention self-builds a sit-up video dataset SDUST-Situp. The data set comprises 108 mp4 videos in 32 groups, the length of each video segment is about 2-3 seconds, and the camera equipment adopts an iPhone13Pro mobile phone. The test environment for sit-ups was arranged according to the "1 minute sit-up" test standard. For each volunteer, the complete motion of its sit-up from start to end was captured. Video with significant blur, jitter, or poor illumination may be culled. In addition, still pictures at the beginning and end are also rejected. The video resolution and frame rate are normalized, the resolution is 720×720, and the frame rate is 30.
The ratio of the training set to the verification set is divided into 4:1, namely 324 frames of the training set and 80 frames of the verification set.
Aiming at sit-up action gesture detection and key point detection tasks, mAP is used as an evaluation index, which is mAP@0.5 and mAP@0.5-0.95 respectively. mAP@0.5 denotes the average precision at IoU of 0.5, mAP@0.5-0.95 denotes the average precision at different IoU values (from 0.5 to 0.95, step size 0.05).
Wherein IoU (Intersection over Union) denotes the degree of overlap between the predicted value and the actual value.
The detection precision of the key points adopts mAPPose as an evaluation index, and the calculation formula is as follows:
(6)
In the formula, N represents the number of categories of the key point.
APPose represents the average accuracy of the keypoint detection, calculated as:
(7)
OKS (object keypoint similarity) denotes the similarity between the true keypoint and the predicted keypoint, calculated by:
(8)
Wherein dpi represents the Euclidean distance between the i-th key point detection position and the real position, Sp represents the scale factor of p points, vpi represents the visibility of the key points, 0 is unlabeled, 1 is that the label is blocked, and 2 is that the label is visible.
Deltai represents the normalization factor for the i-type keypoints. When vpi is more than 0, delta is 1, and when vpi is less than or equal to 0, delta is 0.
T is a given threshold, when OKSp > T, β takes the value of OKSp, otherwise β takes 0.
The detection precision of the action gesture adopts mAPBox as an evaluation index, and the calculation formula is as follows:
(9)
In the formula, M represents the number of categories of the motion gesture. APBox represents the average accuracy of motion gesture detection, which is obtained by calculating the area under the P-R curve, the area can be calculated by integral, and the calculation formula is:
(10)
In the formula, P represents accuracy (Precision) and R represents Recall (Recall), and the calculation formula is as follows:
(11)
(12)
where s represents a certain gesture of action, and non-s represents other states or actions than s; TPs represents the number of frames correctly classified as s; FPs denotes the number of frames misclassified as s; FNS represents the number of frames misclassified as non-s.
The prediction accuracy of the motion quality score adopts a Mean Square Error (MSE) and a Spearman rank correlation coefficient rho (Spearman's rank correlation coefficient) as evaluation indexes to reflect the correlation degree of the prediction score and the true score.
Let SP denote the prediction score of the model, SG denote the score (score range is 0-5) marked by the expert, and the calculation formulas of the mean square error MSE and the spearman rank correlation coefficient ρ are respectively:
(13)
(14)
Where L represents the number of videos, and SlP and SlG represent the prediction score and the true score of the first video, respectively. R (SlP) and R (SlG) represent a score series ranking of the predicted score and the true score, respectively, and the superscript l represents the video sequence number.
The higher the value of ρ is [ -1,1], the better the fractional predictive performance of the network.
The running environment of all experiments is a 64-bit Windows10 operating system, the CPU is Intel (R) Xeon (R) Silver 4210R, the memory is 128GB, the display card is NVIDIA RTX A6000, and the memory is 48GB.
Programming is achieved using Python 3.8, using Pytorch-GPU 1.9.0 and CUDA 11.1. The input image size was 640 x 640, the batch size was 32, the initial learning rate was 0.01, the decay factor was 0.0005, and the epochs was 300, and training was accelerated using a AdamW optimizer. Other parameters not mentioned all use the default parameters of YOLOv-pose authorities.
The invention provides an improved key posture estimation and key point detection network based on YOLOv-Pose network, which is used for performing action posture estimation and key point detection on a video key action frame and performing sit-up action quality weighted scoring through key angle characteristics, and the overall model is named as Situp-PoseNet. Table 2 shows the performance indicators of the present invention model Situp-PoseNet and other SOTA methods (PEPCONT, ST-GCN, GDLT) on the SDUST-Situp dataset. Wherein PEPoseNet and ST-GCN belong to a model based on skeleton key points, and GDLT is a transducer framework for video streams.
For the comparison algorithm, no change was made except that the last scoring module was modified to fit the action quality assessment for sit-ups. Whether the gesture estimation and key point detection performance or the motion quality score prediction performance are achieved, situp-PoseNet show obvious advantages, and the mean square error MSE and the spearman rank correlation coefficient rho respectively reach 0.017 and 0.933, so that the space-time correlation capability of Situp-PoseNet in capturing the key points of sit-up motion is improved obviously. Compared with YOLOv-Pose reference model, the spearman rank correlation coefficient is improved by 7.6%, and the predictive fractional mean square error is reduced by 4.3%.
TABLE 2 Performance of quality of action assessment for different models on SDUST-Situp datasets
The PEPoseNet model incorporates dataset features with pull-up human body joint point coordinates and sports equipment key point coordinates during training, but this training approach may not be fully applicable to sit-up motion feature detection. Therefore, the key point estimation performance is relatively low, and mAPPose @0.5/% is only 83.1%, so that the performance of the action quality estimation is slightly insufficient. In contrast, the ST-GCN model has stronger feature expression and generalization capabilities by automatically learning spatial and temporal features from the data, making it superior to PEPoseNet in keypoint estimation. However, the ST-GCN may have a problem of redundant keypoints, and the scoring network may be affected by redundant keypoints in the calculation process, so that the score prediction performance of the scoring network is relatively poor. GDLT is to directly extract the characteristics of the video stream, the motion characteristics are easily affected by background factors such as environment, and the motion quality score prediction performance is general. The MSE of GDLT is worst from the mean square error MSE of the predicted score, but the spearman rank correlation coefficient ρ is not worst, and is 0.753, which indicates that the performance of the MSE in the ranking of the predicted score series is better than the error performance between the predicted score and the true score. Although PEPCOSeNet, ST-GCN and GDLT have advantages and disadvantages in key point estimation and action quality estimation, the PEPCoNet, ST-GCN and GDLT have certain reference value in practical application.
Tables 3 and 4 list the results of the ablation experiments of gesture detection and key point estimation and the results of the ablation experiments of the motion quality prediction scores, respectively. In order to verify the effect of adding a GAM attention mechanism and an EMA module on YOLOv-Pose gesture detection and key point estimation accuracy, the invention adopts an ablation test for comparison. It can be seen that the reference model YOLOv-Pose has the worst gesture detection and key point estimation capabilities, the average mAPBox@0.5/%、mAPBox @0.5-0.95/% is 94.8% and 91.4%, and the average mAPPose@0.5/%、mAPpose @0.5-0.95/% is 94.8% and 93.5%, respectively, because the reference model is trained on COCO data sets marked with 17 key points of a human body, and for sit-up actions, the gesture detection can have redundancy phenomenon, and the detected key points are more chaotic. After the GAM attention mechanism is added in the reference model, the model focuses on the human body posture interested region more, and the posture detection performance and the key point estimation performance are improved by 0.1%, 2.2%, 0.1% and 1.4% respectively. After the EMA module is added in the reference model, the model focuses on the characteristics of the key points of bones, so that the estimation capability of the key points is improved, and meanwhile, the gesture detection capability is also improved by 0.6%, 1.7%, 0.6% and 0.7% respectively. Finally, GAM and EMA modules are added to the reference model at the same time, so that the model can further improve the estimation capability of key points on the basis of extracting the human body posture interested region, and indexes are respectively improved by 1.2%, 3%, 1% and 2.2%.
TABLE 3 results of ablation experiments for gesture detection and keypoint estimation
Meanwhile, the invention also compares the influence of YOLOv-Pose model added with a GAM attention mechanism and an EMA module on the performance of action quality scoring, and the result is shown in table 4. It can be seen that after the GAM module is added into the YOLOv-Pose reference model, the model can pay more attention to the interesting human body action gesture area, the feature area is reduced, so that the key point extraction capacity is improved, the score prediction result is improved, the MSE is reduced by 3.2%, and the rho is improved by 5.6%. And an EMA module is added into the reference model, so that the model focuses on key point characteristics of human bones, the characteristic extraction capability is enhanced, the score prediction result is improved, the MSE is reduced by 1.1%, and the rho is improved by 6.6%. And the GAM and the EMA modules are added into the reference model at the same time and supplement each other, so that the model can further improve the estimation capability of key points on the basis of extracting the region of interest, the accuracy of score estimation is effectively improved, and the spearman rank correlation coefficient rho can reach 0.933.
Table 4 results of ablation experiments for motion quality prediction scores
Experimental results show that the model Situp-PoseNet provided by the invention can accurately score the quality of a single complete sit-up action, the spearman rank correlation coefficient reaches 0.933, and the mean square error reaches 0.017. Therefore, according to the Situp-PoseNet model provided by the invention, the accuracy of gesture recognition and key point positioning can be effectively improved by adding the GAM module and the EMA attention mechanism in YOLOv-Pose, and the effectiveness of the improvement method is proved.
The number of video frames and frame size have a direct impact on computational resources and motion quality assessment performance. Under the precondition of ensuring the motion quality evaluation performance, the video frame number and the frame size can be reduced as much as possible, so that the calculation cost can be reduced, and the real-time detection on the portable equipment is facilitated. Table 5 sets forth the results of experiments using different frames numbers and frame sizes for the SDUST-Situp dataset.
Table 5 SDUST-Situp evaluation Performance of datasets of different frame numbers and frame sizes
As can be seen from table 5 above, the effect of frame size is comparatively larger than the effect of frame number. The larger frame size means higher resolution, the more detail can be provided, the better the fractional predictive performance of the model. For example, in the case of the frame number of 4, the MSE and ρ of the frame size of 320×320 are 0.082 and 0.878, respectively, and the MSE and ρ of the frame size of 1024×1024 are 0.016 and 0.937, respectively, the performance improvement effect is remarkable. A similar finding is found for a frame number of 8. But the calculation time increased from 3.92 seconds to 8.02 seconds. Therefore, the selection of the frame number and the resolution needs to be considered in a compromise, and the data in the observation table can be considered that the combination effect of the frame number of 4 and the frame size of 640×640 is optimal.
The method of the invention provides an improved key posture estimation and key point detection network based on YOLOv-Pose network, carries out motion posture estimation and key point detection on video key action frames, greatly improves the motion posture estimation and key point detection capability, effectively improves the detection capability of human body posture and the positioning capability of skeleton key points, and enables the motion quality prediction score of the scoring network to be more accurate.
The foregoing description is, of course, merely illustrative of preferred embodiments of the present invention, and it should be understood that the present invention is not limited to the above-described embodiments, but is intended to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

Claims (10)

The input key frame image sequence firstly carries out image feature extraction through two convolution modules, then captures complex features in the image through one C2f module, then carries out image feature extraction through one convolution module, then captures complex features in the image through one C2f module, carries out image feature extraction through one convolution module, and then enters a global attention mechanism GAM, so that the network focuses on the relevant feature areas of the human body more comprehensively, and context information is effectively integrated among different channels of the feature map; then forming aggregation features on multiple scales through the space pyramid pooling layer; the importance of different scale features is adaptively emphasized by introducing a multi-scale attention mechanism EMA so as to improve the detection and positioning capability of key points of human bones.
The input aggregation features are firstly divided into a plurality of sub-features to form feature groups, and the feature groups are processed through three parallel lines in two branches; the first branch is 1*1 branches, the first branch comprises two parallel lines, a one-dimensional horizontal global pooling and a one-dimensional vertical pooling are adopted to encode feature groups along two space directions respectively, then two encoding features are connected and shared 1*1 to be convolved, then the output of the 1*1 convolution is decomposed into two vectors and then respectively and correspondingly passes through a nonlinear Sigmoid activation function, and then the two vectors and the feature groups are subjected to re-weighting operation and grouping normalization, and finally the features are remolded through average pooling and normalized by using a Softmax normalization function; the second branch is 3*3 branches, the feature group captures local cross-channel interaction through 3*3 convolution to expand the feature space, the feature is remodelled through average pooling and normalized by using a Softmax normalization function, and matrix multiplication is performed on the feature subjected to grouping normalization with the first branch by using a Matmul function to obtain a first feature matrix; meanwhile, the features subjected to 3*3 convolutions are subjected to matrix multiplication by using Matmul functions and features normalized by the first branch normalization functions to obtain a second feature matrix; adding the first characteristic matrix and the second characteristic matrix, and generating an attention weight matrix through a Sigmoid activation function; and finally, carrying out re-weighting operation on the input feature group and the attention weight matrix to obtain output features optimized by the EMA attention mechanism.
8. The automatic assessment method for sit-up motion quality based on gesture key points according to claim 1, wherein in the step 3,4 key stages in the sit-up process are selected as evaluation objects for realizing the quality score of completion of the sit-up motion; the sit-up starting posture P1 is characterized in that a subject lies on the back on a cushion, two shoulder blades touch the cushion, the knee is bent by about 90 degrees, and the hands hold the head; the upper body lifting gesture P2 is characterized in that the upper body of the subject is lifted off the soft cushion, and the hands are held; the abdomen lifting gesture P3 is characterized in that the subject finishes the sitting stage by means of abdomen force and holds the hands with the head; the sit-up ending posture P4, the motion features the subject sitting up with both elbows touching or exceeding the knees.
CN202411161998.3A2024-08-232024-08-23 Automatic evaluation method of sit-up action quality based on posture key pointsActiveCN118968629B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411161998.3ACN118968629B (en)2024-08-232024-08-23 Automatic evaluation method of sit-up action quality based on posture key points

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411161998.3ACN118968629B (en)2024-08-232024-08-23 Automatic evaluation method of sit-up action quality based on posture key points

Publications (2)

Publication NumberPublication Date
CN118968629Atrue CN118968629A (en)2024-11-15
CN118968629B CN118968629B (en)2025-03-14

Family

ID=93407436

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411161998.3AActiveCN118968629B (en)2024-08-232024-08-23 Automatic evaluation method of sit-up action quality based on posture key points

Country Status (1)

CountryLink
CN (1)CN118968629B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119580016A (en)*2025-02-062025-03-07天津市天河计算机技术有限公司 Mineral identification method, device and storage medium
CN119693840A (en)*2024-11-212025-03-25南开大学 A multi-person sports event counting method, system and medium based on multi-key point detection

Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115019338A (en)*2022-04-272022-09-06淮阴工学院 A multi-person pose estimation method and system based on GAMHR-Net
CN115171208A (en)*2022-05-312022-10-11中科海微(北京)科技有限公司Sit-up posture evaluation method and device, electronic equipment and storage medium
CN115953834A (en)*2022-12-162023-04-11重庆邮电大学Multi-head attention posture estimation method and detection system for sit-up
CN116453216A (en)*2023-03-282023-07-18深圳市菲普莱体育发展有限公司Sit-up detection method, apparatus, device, and computer-readable storage medium
CN116563946A (en)*2023-05-122023-08-08宁波愉阅网络科技有限公司 A system and method for evaluating student sports training based on artificial intelligence
CN117095457A (en)*2023-08-032023-11-21中山大学Digital person reconstruction motion scoring method, system, equipment and medium
CN117315770A (en)*2023-08-042023-12-29深圳大学Human behavior recognition method, device and storage medium based on skeleton points
CN117854107A (en)*2023-12-292024-04-09上海网达软件股份有限公司Human body picture key point detection method, device and storage medium based on improved YOLOv8

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115019338A (en)*2022-04-272022-09-06淮阴工学院 A multi-person pose estimation method and system based on GAMHR-Net
CN115171208A (en)*2022-05-312022-10-11中科海微(北京)科技有限公司Sit-up posture evaluation method and device, electronic equipment and storage medium
CN115953834A (en)*2022-12-162023-04-11重庆邮电大学Multi-head attention posture estimation method and detection system for sit-up
CN116453216A (en)*2023-03-282023-07-18深圳市菲普莱体育发展有限公司Sit-up detection method, apparatus, device, and computer-readable storage medium
CN116563946A (en)*2023-05-122023-08-08宁波愉阅网络科技有限公司 A system and method for evaluating student sports training based on artificial intelligence
CN117095457A (en)*2023-08-032023-11-21中山大学Digital person reconstruction motion scoring method, system, equipment and medium
CN117315770A (en)*2023-08-042023-12-29深圳大学Human behavior recognition method, device and storage medium based on skeleton points
CN117854107A (en)*2023-12-292024-04-09上海网达软件股份有限公司Human body picture key point detection method, device and storage medium based on improved YOLOv8

Cited By (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119693840A (en)*2024-11-212025-03-25南开大学 A multi-person sports event counting method, system and medium based on multi-key point detection
CN119580016A (en)*2025-02-062025-03-07天津市天河计算机技术有限公司 Mineral identification method, device and storage medium
CN119580016B (en)*2025-02-062025-06-03天津市天河计算机技术有限公司 Mineral identification method, device and storage medium

Also Published As

Publication numberPublication date
CN118968629B (en)2025-03-14

Similar Documents

PublicationPublication DateTitle
CN118968629B (en) Automatic evaluation method of sit-up action quality based on posture key points
Li et al.[Retracted] Intelligent Sports Training System Based on Artificial Intelligence and Big Data
CN108734104B (en)Body-building action error correction method and system based on deep learning image recognition
WO2021057810A1 (en)Data processing method, data training method, data identifying method and device, and storage medium
CN113762133A (en)Self-weight fitness auxiliary coaching system, method and terminal based on human body posture recognition
CN114170537B (en) A multimodal three-dimensional visual attention prediction method and its application
CN110298279A (en)A kind of limb rehabilitation training householder method and system, medium, equipment
CN119068558B (en)Deep learning-based athlete throwing action analysis and training method and equipment
CN103310191B (en)The human motion recognition method of movable information image conversion
CN112464915A (en)Push-up counting method based on human body bone point detection
CN110991268A (en) A method and system for quantitative analysis of Parkinson's hand motion based on depth image
CN117058758B (en)Intelligent sports examination method based on AI technology and related device
CN111833439A (en)Artificial intelligence-based ammunition throwing analysis and mobile simulation training method
CN115953834A (en)Multi-head attention posture estimation method and detection system for sit-up
CN111046715A (en)Human body action comparison analysis method based on image retrieval
CN115497170B (en) A method for identifying and scoring queue-style skydiving training actions
CN115909225B (en) A ship detection method based on OL-YoloV5 online learning
CN114092971B (en) A human motion assessment method based on visual images
Zeng et al.Machine learning based automatic sport event detection and counting
CN113378772B (en)Finger flexible detection method based on multi-feature fusion
CN117373109A (en)Posture assessment method based on human skeleton points and action recognition
CN116343332A (en)Intelligent table tennis training method and system thereof
Zhang et al.Intelligent Pose Recognition and Evaluation System for Rowing Sports
CHEN et al.Action Recognition Method of Basketball Training Based on Big Data Technology.
Moodley et al.I3D-AE-LSTM: A 2-Stream autoencoder for action quality assessment using a newly created cricket batsman video dataset

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp