Disclosure of Invention
(One) solving the technical problems
In order to solve the problems in the background technology, the invention provides a system for identifying and automatically generating a review for a sports event in real time, which has the advantages of real-time analysis of video pictures, score statistics update and real-time superposition of playback videos, and solves the problem that the conventional device cannot meet the requirements of the modern sports event on quick, accurate and comprehensive review video generation.
(II) technical scheme
In order to achieve the purpose, the invention provides the following technical scheme that the system for identifying the sports event in real time and automatically generating the review comprises the following steps:
the acquisition module acquires judge and player video stream data in real time;
the AI intelligent recognition and statistics module comprises a processing unit and a statistics unit;
The processing unit is coupled with the acquisition module, receives the video stream data, performs image processing, gesture separation, feature extraction and pattern matching, and automatically identifies gesture types in the video stream;
The statistics unit is coupled with the processing unit, calculates scores according to gesture recognition results, and records and updates score or punishment information of the scoring party and the offender;
Further comprises:
The video storage and processing module is coupled to the AI intelligent recognition and statistics module and is used for automatically storing video stream fragments and generating playback videos with X time length when the score or the violation gesture type is recognized;
A video synthesizing unit synthesizing and superposing the live picture and the generated playback video into a video picture;
and the video output and display module is used for outputting the video picture to display equipment for display.
In the above technical solution, preferably, the processing unit includes:
The preprocessing module receives the video stream data and obtains an image set through image denoising, image conversion and image enhancement processing respectively;
the gesture detection separation module is used for identifying the region where the gesture which accords with the threshold range in the image set is located according to the preset skin color threshold range, and separating the gesture from the background to obtain a gesture image set;
The feature extraction and identification module is used for receiving the gesture image set, identifying and extracting a gesture feature set, and extracting key points of the detected gesture according to the extracted gesture feature set to obtain a geometric feature image set;
And the gesture analysis module is used for identifying and analyzing the gesture type according to matching of the extracted geometric feature image set with the predefined action and gesture mode.
In the above technical solution, preferably, the specific steps of performing image denoising, image conversion and image enhancement processing by the preprocessing module to obtain an image set include:
Image denoising, namely processing image pixels by adopting a filtering algorithm to remove image noise;
image conversion, namely converting the image with the image noise removed from BGR color space to HSV color space.
Image enhancement, namely adjusting the gray scale distribution of the image or enhancing the contrast in the image for the converted image.
In the above technical solution, preferably, the filtering algorithm includes gaussian filtering and median filtering, and the image enhancement includes histogram equalization and contrast enhancement.
In the above technical solution, preferably, the gesture detection separation module identifies and separates to obtain a gesture image set specifically includes:
skin detection, namely identifying pixels conforming to a preset skin color threshold range in an image, and carrying out morphological operation on the identified skin pixels to optimize the boundary of a skin region;
Gesture segmentation, namely separating gestures from the background through an image segmentation algorithm;
The image segmentation algorithm includes threshold segmentation, edge detection, and morphological operations.
In the above technical solution, preferably, the step of identifying and extracting the gesture feature set by the feature extraction and identification module specifically includes:
Extracting the gesture feature set, namely extracting features of the gesture image set by using a deep learning model to generate a feature map containing human body structure information;
In particular, the deep learning model includes a Convolutional Neural Network (CNN);
the Convolutional Neural Network (CNN) feature extraction process is as follows:
The preprocessed gesture image set is input into a Convolutional Neural Network (CNN) model;
The image data is converted into a feature map containing gesture key information through convolution layers, activation functions, pooling layer-by-layer convolution, activation and pooling transfer;
The characteristic map after repeated rolling and pooling is input into a full-connection layer, and the full-connection layer is mapped into a sample marking space to carry out classification or regression task extraction to obtain the characteristic map containing human body structure information.
In the foregoing technical solution, preferably, the Convolutional Neural Network (CNN) model includes:
At least one convolution layer for extracting local features from an input image;
an activation function, connected after each convolution layer;
At least one pooling layer for reducing the dimension of the feature map and preserving important features;
At least one fully connected layer, located after the convolution layer and the pooling layer, for mapping the learned feature representation to a sample label space;
Wherein the convolution layer comprises a plurality of convolution kernels, each convolution kernel performing a convolution operation on the input image through a sliding window;
The activation function includes a ReLU function;
the pooling layer comprises a maximum pooling operation;
and flattening the feature map into a one-dimensional vector by the full connection layer, and outputting an image classification result.
In the above technical solution, preferably, the step of extracting the key points of the gesture detected by the feature extraction recognition module to obtain the geometric feature image set specifically includes:
Extracting a geometric feature image set, namely detecting key points of gestures by using a deep learning model based on the extracted feature image, and generating the geometric feature image set containing key point information;
specifically, the deep learning model includes OpenPose models;
the OpenPose model extraction process is as follows:
The preprocessed gesture feature set is input into the OpenPose model.
Predicting the positions of key points of a human body in an image through a feature map and a confidence map for an input gesture feature set, analyzing the relation between the key points through a part affinity field, and detecting the connection and arrangement of all parts and limbs of the human body;
a set of geometric feature images containing keypoint information is generated.
In the above technical solution, preferably, the gesture analysis module identifies and analyzes gesture types specifically includes:
matching the key points to corresponding gesture structures to form complete gesture gestures, and recognizing the gestures as scoring or illegal gestures;
Matching referees/players at front/back positions through facial features in the gesture image;
the left hand position and the right hand position of the referee gesture are matched and distinguished through the relative position characteristics of the hand and the body in the gesture image, and a left hand judgment result and a right hand judgment result are output;
and determining the number of referee fingers through matching according to the relative position relation between finger joints in the gesture image.
(III) beneficial effects
Compared with the prior art, the invention has the following beneficial effects:
The invention can accurately identify key information such as judge actions, contestant actions and the like through the AI intelligent identification and statistics module, further record information such as scores or offenders, scores or offender times and the like in real time, and realize automatic real-time identification and score statistics effects;
The judge action, the game player action and the like are automatically identified by analyzing the video stream pictures in real time, so that the trigger condition time can be accurately mastered, the review video can be generated in real time, the requirement on quick response in the game process is met, the generation efficiency is greatly improved, the manual intervention and error are reduced, the judge can timely make accurate judgment in the game process, the fairness and the ornamental value of the game are improved, and the dispute caused by misjudgment or omission is avoided;
Meanwhile, when the review video is automatically generated, the wonderful instant video can be overlapped in real time to the live broadcast or rebroadcast picture, so that more visual and vivid watching experience is provided for the user, and the ornamental value of the match is increased.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1 to 3, the present invention provides a system for real-time identification and automatic review generation of sports events, which is applied to various scenes of jujitsu games, ball games, fight games, athletic games and other fields, and specifically comprises:
the acquisition module acquires judge and player video stream data in real time;
The acquisition module comprises a plurality of cameras, wherein one camera is specially used for following a referee to capture gesture actions of the referee in real time, the other cameras are used for following a competition player to capture real-time pictures of a competition field, and each camera is connected to the AI intelligent recognition and statistics module through a network or a data line to realize synchronous transmission of video data;
the AI intelligent recognition and statistics module comprises a processing unit and a statistics unit;
The processing unit is coupled with the acquisition module, receives the video stream data, performs image processing, gesture separation, feature extraction and pattern matching, and automatically identifies gesture types in the video stream;
the processing unit includes:
The preprocessing module receives the video stream data and obtains an image set through image denoising, image conversion and image enhancement processing respectively;
Preferably, the specific steps of performing image denoising, image conversion and image enhancement processing by the preprocessing module to obtain an image set include:
Image denoising, namely processing image pixels by adopting a filtering algorithm to remove image noise;
the filtering algorithm comprises Gaussian filtering and median filtering, and can improve the image quality through various wave-passing operations;
image conversion, namely converting the image with the image noise removed from the BGR color space to the HSV color space, which is convenient for subsequent skin detection and gesture positioning.
Image enhancement, namely adjusting the gray scale distribution of the image or enhancing the contrast in the image for the converted image.
Image enhancement comprises histogram equalization and contrast enhancement, so that the image contrast can be improved, and gestures are clearer.
The gesture detection separation module is used for identifying the region where the gesture which accords with the threshold range in the image set is located according to the preset skin color threshold range, and separating the gesture from the background to obtain a gesture image set;
preferably, the gesture detection separation module identifies and separates to obtain a gesture image set specifically includes:
Skin detection, namely identifying pixels conforming to a preset skin color threshold range in an image in an HSV color space, and carrying out morphological operation on the identified skin pixels to optimize the boundary of a skin region;
Gesture segmentation, namely separating gestures from the background through an image segmentation algorithm;
The image segmentation algorithm includes threshold segmentation, edge detection, and morphological operations.
The feature extraction and identification module is used for receiving the gesture image set, identifying and extracting a gesture feature set, and extracting key points of the detected gesture according to the extracted gesture feature set to obtain a geometric feature image set;
preferably, the step of identifying and extracting the gesture feature set by the feature extraction and identification module specifically includes:
Extracting the gesture feature set, namely extracting features of the gesture image set by using a deep learning model to generate a feature map containing human body structure information;
In particular, the deep learning model includes a Convolutional Neural Network (CNN);
the Convolutional Neural Network (CNN) feature extraction process is as follows:
The preprocessed gesture image set is input into a Convolutional Neural Network (CNN) model;
The image data is converted into a feature map containing gesture key information through convolution layers, activation functions, pooling layer-by-layer convolution, activation and pooling transfer;
The feature images after repeated rolling and pooling are input into a full-connection layer, and the full-connection layer is mapped into a sample marking space to be subjected to classification or regression task extraction to obtain the feature images containing human body structure information, such as edges, corner points, textures and the like.
Preferably, the Convolutional Neural Network (CNN) model includes:
At least one convolution layer for extracting local features from an input image;
an activation function, connected after each convolution layer;
At least one pooling layer for reducing the dimension of the feature map and preserving important features;
At least one fully connected layer, located after the convolution layer and the pooling layer, for mapping the learned feature representation to a sample label space;
Wherein the convolution layer comprises a plurality of convolution kernels, each convolution kernel performing a convolution operation on the input image through a sliding window;
The activation function includes a ReLU function;
the pooling layer comprises a maximum pooling operation;
and flattening the feature map into a one-dimensional vector by the full connection layer, and outputting an image classification result.
And the Convolutional Neural Network (CNN) model constructed as described above further comprises, prior to actual application:
Collecting a large amount of image or video data containing various gestures, which data covers a wide range of scenes, lighting conditions, and gesture types;
cleaning the collected data, removing noise, blurring or repeated images, and ensuring the data quality;
Labeling the gesture image, and determining the position of a key point of each gesture;
The data volume is increased through methods such as rotation, scaling and cutting, and the generalization capability of the model is improved;
Training the Convolutional Neural Network (CNN) model by the collected large amount of data, and improving the robustness of the model through the diversity of training data;
preferably, the step of extracting the key points of the gesture detected by the feature extraction and identification module to obtain the geometric feature image set specifically includes:
Extracting a geometric feature image set, namely detecting key points of gestures by using a deep learning model based on the extracted feature image, and generating the geometric feature image set containing key point information;
specifically, the deep learning model includes OpenPose models;
the OpenPose model extraction process is as follows:
The preprocessed gesture feature set is input into the OpenPose model.
Predicting the positions of key points of a human body in an image through a feature map and a confidence map for an input gesture feature set, analyzing the relation between the key points through a part affinity field, and detecting the connection and arrangement of all parts and limbs of the human body;
generating a set of geometric feature images containing keypoint information
The gesture analysis module is used for matching the extracted geometric feature image set with a predefined action and gesture mode and identifying and analyzing gesture types;
Preferably, the gesture analysis module identifies and analyzes gesture types specifically includes:
matching the key points to corresponding gesture structures to form complete gesture gestures, and recognizing the gestures as scoring or illegal gestures;
Matching referees/players at front/back positions through facial features in the gesture image;
the left hand position and the right hand position of the referee gesture are matched and distinguished through the relative position characteristics of the hand and the body in the gesture image, and a left hand judgment result and a right hand judgment result are output;
and determining the number of referee fingers through matching according to the relative position relation between finger joints in the gesture image.
Specifically, key points are matched to corresponding gesture structures according to predicted key point positions and PAFS to form complete gesture gestures, and positions (front and back) of referees are calculated through face positions, so that left and right hands are distinguished, and a competition team is distinguished;
The statistics unit is coupled with the processing unit, calculates scores according to gesture recognition results, and records and updates score or punishment information of the scoring party and the offender;
The system also comprises a database for storing and managing the competition data, and ensuring the accuracy, the integrity and the traceability of the data;
Further comprises:
The video storage and processing module is coupled to the AI intelligent recognition and statistics module and is used for automatically storing video stream fragments and generating playback videos with X time length when the score or the violation gesture type is recognized;
The generated playback video adopts an efficient video compression algorithm and a storage technology, so that the definition and storage efficiency of video data are ensured, and the playback video is stored in a local storage or cloud server;
Specifically, automatically saving video segments saves highlight transient video for X time periods since the referee's referee action was identified to be forward.
The X time length can be set through a system, different events are different in time length required to be reviewed, some scoring actions are longer, some scoring actions are shorter, the X time length is also the cache recording time length of the AI system, and videos of the time length can be viewed in real time and subsequently viewed through the cache recording time length;
A video synthesizing unit synthesizing and superposing the live picture and the generated playback video into a video picture;
The video output and display module is used for outputting video pictures to display equipment for display;
outputting the multipath video signals to display equipment through a video output and display module;
specifically, the method comprises the steps of on-site real-time picture, scoring or illegal action picture and superimposed video picture, so as to support various output formats and resolutions and meet video display requirements under different scenes and requirements;
And the display equipment comprises a field large screen, a television, a mobile phone, a screen and the like which can accept live broadcast pictures and the like to the electronic equipment.
The device thoroughly changes the mode of generating the review video of the traditional sports event, needs manual participation before and realizes automation completely. The method not only reduces the complexity and error of manual operation, but also enables the real-time recording of scoring actions to be possible. When objection occurs in the competition process, the referee can quickly review the objection video to make fair and accurate judgment.
The working principle and the using flow of the invention are as follows:
Receiving the video stream acquired by the acquisition module in real time, performing real-time image processing on the acquired video stream to obtain an image set, obtaining a gesture image set through operations such as action recognition, skin detection, gesture segmentation and the like, performing judging gesture action recognition on the gesture image set by using a deep learning model, matching with a predefined action and gesture mode when a gesture is recognized, recognizing gesture types representing scores or violations, calculating specific score values or violations types by recognizing the number of fingers, judging left and right hand attributes of the gesture to determine scores or violations, triggering a video storage and processing module to generate and store playback videos with X time length, and outputting the playback videos to a display device for display after being overlapped with a field picture and a synthesis picture;
specifically, score or offence action pictures, real-time pictures, score/offence action superposition pictures and on-site original pictures are all transmitted to a guide table and then pushed to a live broadcast pushing system;
And displaying score, score or offence action pictures and real-time pictures and score/offence action superposition pictures on a large screen on site.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.