CN116631059A

Movatterモバイル変換

Info

Publication number: CN116631059A
Application number: CN202310600840.0A
Authority: CN
Inventors: 马楠; 汪成; 梁晔; 肖传胜
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-22
Anticipated expiration: 2043-05-25
Also published as: CN116631059B

Abstract

The invention discloses a sound-image combined cross-mode sound source positioning and gesture recognition method, which comprises the following specific implementation processes: step 1, establishing a space polar coordinate system; step 2, dividing the three-dimensional space into different subspaces; step 3, preprocessing the audio information; step 4, convoluting the neural network model; step 5, performing human body detection by using YOLOv 7; step 6, extracting skeleton information of the gesture by using alpha phase; step 7, performing gesture recognition by using a space-time diagram convolution network; the position of the instruction initiator is precisely positioned through the audio information, and then the gesture of the instruction initiator is identified through the visual information, so that the corresponding instruction is completed.

Description

Translated fromChinese

一种声像联合的跨模态声源定位与手势识别方法A cross-modal sound source localization and gesture recognition method based on combined audio and video

技术领域technical field

本发明涉及一种识别方法，尤其是一种基于声像联合的跨模态声源定位与手势识别的技术，属于计算机视觉技术领域。The invention relates to a recognition method, in particular to a technology of cross-modal sound source localization and gesture recognition based on audio-image combination, which belongs to the field of computer vision technology.

背景技术Background technique

手势识别是计算机视觉的代表性任务之一，精准感知和识别人体手势是智能交互和人机协作的重要前提，近年已成为广为关注的研究领域，例如在行为分析、智能驾驶、医疗控制等应用领域，对肢体语言交互的研究具有重要意义。然而，对于高动态、强对抗的大尺度室外复杂环境，如战场等，目前的方法很难精准定位到指令发起者的位置并对其进行手势识别，从而执行对应指令。Gesture recognition is one of the representative tasks of computer vision. Accurate perception and recognition of human gestures is an important prerequisite for intelligent interaction and human-computer collaboration. In recent years, it has become a research field that has attracted wide attention, such as behavior analysis, intelligent driving, medical control, etc In the field of application, the research on body language interaction is of great significance. However, for large-scale outdoor complex environments with high dynamics and strong confrontation, such as battlefields, it is difficult for current methods to accurately locate the location of the command initiator and perform gesture recognition on it to execute the corresponding command.

为了应对复杂环境下的手势识别，传统的方法仅使用视觉信息，通过循环神经网络构建长期关联，可以通过使用全局上下文存储单元关注每一帧中的信息节点，获得更多的行为特征。还有一些方法旨在利用注意力机制聚合时空图像区域的特征，有效地去除噪声等影响，提高识别准确率。然而，这些方法仍然不能在复杂的环境下快速有效的定位关键区域(指令发起者的手势)，这是大尺度室外复杂环境下手势识别任务的重大挑战。声像联合的跨模态声源定位与手势识别方法旨在采用多模态数据，通过音频信息对指令发起者进行精准定位，以解决复杂场景下由于背景杂乱，智能机器人无法识别有效手势的问题，再通过视觉信息对指令发起者的手势进行有效识别，从而增强高动态、强对抗室外复杂场景下手势识别的准确性。此外，传统的声源定位方法主要依赖于信号处理和数学算法，例如波束形成、交叉相关、最小二乘等方法，这些方法在一定程度上能够实现声源定位，但其精度和鲁棒性受到环境噪声等因素的影响。In order to cope with gesture recognition in complex environments, traditional methods only use visual information and construct long-term associations through recurrent neural networks. By using global context storage units to focus on information nodes in each frame, more behavioral features can be obtained. There are also some methods that aim to use the attention mechanism to aggregate the features of spatio-temporal image regions, effectively remove noise and other effects, and improve recognition accuracy. However, these methods still cannot quickly and effectively locate the key area (the gesture of the instruction initiator) in a complex environment, which is a major challenge for gesture recognition tasks in large-scale outdoor complex environments. The method of cross-modal sound source localization and gesture recognition combined with audio and video aims to use multi-modal data to accurately locate the initiator of the command through audio information, so as to solve the problem that the intelligent robot cannot recognize effective gestures due to the cluttered background in complex scenes , and then effectively recognize the gesture of the command initiator through visual information, thereby enhancing the accuracy of gesture recognition in high-dynamic and strong-resistant outdoor complex scenes. In addition, traditional sound source localization methods mainly rely on signal processing and mathematical algorithms, such as beamforming, cross-correlation, least squares, and other methods. These methods can achieve sound source localization to a certain extent, but their accuracy and robustness are limited. Environmental noise and other factors.

为了解决这些问题，本发明专利申请公开了一种声像联合的跨模态声源定位与手势识别方法，该方法将声源信息与视觉信息结合起来，首先利用卷积神经网络对声音信号进行特征提取，在一定程度上提高声源定位的精度和鲁棒性，精准定位指令发起者；接着同时考虑时空关联，利用手势的关节信息通过时空图卷积神经网络识别指令发起者的手势；最终有效地实现大尺度室外复杂场景中的手势识别，提高不确定、高动态的复杂场景下手势识别的准确性，使得智能机器人有效地完成相应的指令。In order to solve these problems, the patent application of the present invention discloses a method for cross-modal sound source localization and gesture recognition based on the combination of sound and image. The method combines sound source information with visual information. Feature extraction, to a certain extent, improves the accuracy and robustness of sound source positioning, and accurately locates the initiator of the command; then considers the spatiotemporal correlation at the same time, and uses the joint information of the gesture to identify the gesture of the initiator of the command through the spatiotemporal graph convolutional neural network; finally Effectively realize gesture recognition in large-scale outdoor complex scenes, improve the accuracy of gesture recognition in uncertain and highly dynamic complex scenes, and enable intelligent robots to effectively complete corresponding instructions.

发明内容Contents of the invention

为了解决复杂场景中机器难以确定指挥者的位置，无法对其手势指令进行精准识别的问题，本发明拟通过结合视听信息，实现手势指令的精准识别和跟踪，从声音中定位命令发起者的位置并从视觉信息中识别命令发起者的姿势，以适应不同环境和场景的需求，具有高效、可靠、鲁棒性高的优点。In order to solve the problem that it is difficult for the machine to determine the position of the commander in a complex scene and to accurately identify its gesture commands, the present invention intends to realize the precise recognition and tracking of gesture commands by combining audio-visual information, and locate the position of the command initiator from the sound And recognize the pose of the command initiator from the visual information to adapt to the needs of different environments and scenarios, which has the advantages of high efficiency, reliability, and high robustness.

为了在复杂环境下实现手势指令的精准识别，本发明采用的技术方案为一种声像联合的跨模态声源定位与手势识别方法，通过音频信息精准定位指令发起者的位置，然后通过视觉信息识别指令发起者的手势，从而完成相应的指令。该方法的具体实施过程如下：In order to realize the precise recognition of gesture commands in a complex environment, the technical solution adopted in the present invention is an audio-visual combined cross-modal sound source localization and gesture recognition method, which accurately locates the position of the command initiator through audio information, and then The information recognizes the gesture of the initiator of the instruction, so as to complete the corresponding instruction. The concrete implementation process of this method is as follows:

步骤1.建立空间极坐标系；Step 1. Establish a space polar coordinate system;

以智能机器人为中心，建立空间极坐标系以此表示声源相对于智能机器人所在的位置。其中，智能机器人中心的初始位置表示为(0,0,0)，表示空间极坐标系的原点；r表示声源到智能机器人中心的距离，/>表示声源与智能机器人中心的方向角，θ表示声源与智能机器人中心的俯仰角。With the intelligent robot as the center, establish a space polar coordinate system In this way, it indicates the position of the sound source relative to the intelligent robot. Among them, the initial position of the center of the intelligent robot is expressed as (0,0,0), which represents the origin of the space polar coordinate system; r represents the distance from the sound source to the center of the intelligent robot, /> Indicates the direction angle between the sound source and the center of the intelligent robot, and θ indicates the pitch angle between the sound source and the center of the intelligent robot.

步骤2.将三维空间划分为不同的子空间；Step 2. Divide the three-dimensional space into different subspaces;

在空间极坐标系中极径为R的范围内，三维空间被分为Z个大小相等且互不相交的子空间，即每个子空间相互独立，且每个子空间都有唯一的三维坐标表示。Within the range of the polar radius R in the space polar coordinate system, the three-dimensional space is divided into Z subspaces of equal size and mutually disjoint, that is, each subspace is independent of each other, and each subspace has a unique three-dimensional coordinate representation.

步骤3.音频信息预处理；Step 3. audio information preprocessing;

智能机器人装备有多个麦克风，将麦克风放置在同一水平面上。在收到多麦克风录制声音后，对音频信息进行预处理，将时域信号转化为频域信号，然后对频域信号进行特征提取，将提取的特征变化为适合卷积神经网络处理的形式。The intelligent robot is equipped with multiple microphones, which are placed on the same horizontal plane. After receiving the sound recorded by multiple microphones, preprocess the audio information, convert the time domain signal into a frequency domain signal, and then perform feature extraction on the frequency domain signal, and change the extracted features into a form suitable for convolutional neural network processing.

步骤4.卷积神经网络模型；Step 4. Convolutional neural network model;

在预处理后的声音信号上，使用卷积神经网络模型进行训练。卷积神经网络模型由4个卷积层和4个池化层组成，其中卷积层用于对不同的局部矩阵和输入图像的卷积核矩阵进行卷积运算，以提取图像特征；池化层用于压缩卷积层提取的特征，减小特征维数。On the preprocessed sound signal, a convolutional neural network model is used for training. The convolutional neural network model consists of 4 convolutional layers and 4 pooling layers, where the convolutional layer is used to perform convolution operations on different local matrices and the convolution kernel matrix of the input image to extract image features; pooling The layer is used to compress the features extracted by the convolutional layer and reduce the feature dimension.

将格式变换后的三维数据作为卷积神经网络的输入，卷积神经网络的输出作为训练特征，记为f'。然后根据训练特征进行分类，确定声源所在子空间，实现声源定位。表示为：The three-dimensional data after the format transformation is used as the input of the convolutional neural network, and the output of the convolutional neural network is used as the training feature, denoted as f'. Then classify according to the training features, determine the subspace where the sound source is located, and realize sound source localization. Expressed as:

g_z＝classify(f')g_z = classify(f')

其中，classify(·)表示分类器函数，g_z表示预测目标声源所在的子空间。根据子空间位置，智能机器人移动到指令发起者面前，智能机器人旋转的方向角为移动距离为r·cosθ。Among them, classify( ) represents the classifier function, and g_z represents the subspace in which the target sound source is predicted. According to the subspace position, the intelligent robot moves to the front of the command initiator, and the direction angle of the intelligent robot rotation is The moving distance is r·cosθ.

步骤5.使用YOLOv7进行人体检测；Step 5. Use YOLOv7 for human detection;

智能机器人根据音频信息得到指令发起者的位置后，移动到指令发起者身边，使用YOLOv7处理视觉信息，对人体目标进行检测；After the intelligent robot obtains the position of the command initiator according to the audio information, it moves to the side of the command initiator, uses YOLOv7 to process the visual information, and detects the human target;

步骤6.使用AlphaPose提取手势的骨骼信息；Step 6. Use AlphaPose to extract the skeletal information of the gesture;

通过YOLOv7检测到人体目标后，使用AlphaPose对人体进行手部关节点提取，得到指令发起者手部的关节点信息。所述关节点信息包括手指、手腕、肘部、肩膀等关键点位置和姿态，为后续手势识别提供重要的特征。After the human target is detected by YOLOv7, AlphaPose is used to extract the hand joint points of the human body, and the joint point information of the command initiator's hand is obtained. The joint point information includes the positions and postures of key points such as fingers, wrists, elbows, and shoulders, which provide important features for subsequent gesture recognition.

步骤7.使用时空图卷积网络进行手势识别；Step 7. Use spatio-temporal graph convolutional network for gesture recognition;

使用时空图卷积网络对AlphaPose提取的手势关节点信息进行建模和处理，从而识别手势的类别。时空图卷积网络使用多层图卷积神经网络，并在空间和时间上进行卷积运算，提取出手势的空间和时间特征，在手势识别方面具有较高的准确率和鲁棒性。Use the spatio-temporal graph convolutional network to model and process the gesture joint point information extracted by AlphaPose to identify the category of gestures. The spatio-temporal graph convolutional network uses a multi-layer graph convolutional neural network and performs convolution operations in space and time to extract the spatial and temporal features of gestures, which has high accuracy and robustness in gesture recognition.

进一步地，所述步骤2中，通过概率分布将三维空间声源定位进行表征，将线性回归问题转变成非线性分类问题。通过从阵列接收的声源信号提取位置特征，使用不同的分类器来决定声源属于哪个子空间。Further, in the step 2, the three-dimensional spatial sound source localization is represented by the probability distribution, and the linear regression problem is transformed into a nonlinear classification problem. By extracting positional features from the sound source signals received by the array, different classifiers are used to decide which subspace the sound source belongs to.

进一步地，所述步骤3的具体实施方式如下：Further, the specific implementation of the step 3 is as follows:

(1)预加重；(1) Pre-emphasis;

对原始音频信号进行高通滤波，增强高频信号的能量，降低低频信号的能量；预加重的过程表示如下：Perform high-pass filtering on the original audio signal to enhance the energy of high-frequency signals and reduce the energy of low-frequency signals; the process of pre-emphasis is expressed as follows:

s'(n)＝s(n)-αs(n-1)s'(n)=s(n)-αs(n-1)

其中，n表示时域信号的采样点索引，s(n)和s(n-1)为输入信号，表示时域信号在相邻两个采样点处的采样值，s'(n)表示预加重后的信号，α是预加重系数。Among them, n represents the sampling point index of the time-domain signal, s(n) and s(n-1) are the input signals, representing the sampling values of the time-domain signal at two adjacent sampling points, and s'(n) represents the preset Emphasized signal, α is the pre-emphasis coefficient.

(2)分帧；(2) Framing;

将音频信号分割为若干个帧，每个帧的长度为固定的时间窗口。分帧过程表示如下：Divide the audio signal into several frames, and the length of each frame is a fixed time window. The framing process is expressed as follows:

s_m(n)＝s'(n)·w(n-mR)s_m (n) = s'(n) w(n-mR)

其中，s_m(n)表示第m帧信号，w(n)表示窗函数，使用汉明窗函数，R表示帧偏移量，取值为帧长的一半。Among them, s_m (n) represents the m-th frame signal, w(n) represents the window function, using the Hamming window function, R represents the frame offset, and the value is half of the frame length.

(3)STFT转换；(3) STFT conversion;

对声音信号进行STFT转换，将时域信号转化为频域信号；具体地，将时域信号分段，每一段时域信号的长度为窗口长度，相邻两段时域信号之间有重叠，对每一段时域信号进行傅里叶变换得到其频域表示，即将时域信号在频域上分解为若干个频带。STFT转换的过程用公式表示如下：Perform STFT conversion on the sound signal to convert the time-domain signal into a frequency-domain signal; specifically, segment the time-domain signal, the length of each time-domain signal is the window length, and there is overlap between two adjacent time-domain signals, Perform Fourier transform on each time-domain signal to obtain its frequency-domain representation, that is, decompose the time-domain signal into several frequency bands in the frequency domain. The process of STFT conversion is expressed as follows:

其中，S_m(k)表示第m帧的频域表示，N表示每一帧的采样点数，k＝n-mR。此外，j表示虚数单位，e表示自然常数。Wherein, S_m (k) represents the frequency domain representation of the mth frame, N represents the number of sampling points in each frame, and k=n-mR. In addition, j represents an imaginary number unit, and e represents a natural constant.

(4)特征提取；(4) feature extraction;

特征提取从频域信号中提取出具有代表性的特征，为后续的声源定位任务提供可靠的输入数据。采用MFCC方法提取出不同频率下的声音特征。Feature extraction extracts representative features from frequency-domain signals to provide reliable input data for subsequent sound source localization tasks. The MFCC method is used to extract the sound features at different frequencies.

首先，将频域信号通过Mel滤波器组，得到不同频率下的系数，公式表示如下：First, pass the frequency domain signal through the Mel filter bank to obtain the coefficients at different frequencies. The formula is expressed as follows:

其中，H_m(k)表示Mel滤波器组的第m个滤波器，h(m)表示第m个Mel频率对应的频率，E_m表示第m帧的Mel频率系数。Wherein, H_m (k) represents the m-th filter of the Mel filter bank, h(m) represents the frequency corresponding to the m-th Mel frequency, and E_m represents the Mel frequency coefficient of the m-th frame.

对滤波后的Mel频率系数进行离散余弦变换DCT：对取对数后的Mel频率系数进行DCT变换，得到一组MFCC系数，MFCC系数表示音频信号的语音特征，表示如下：Discrete cosine transform DCT is performed on the filtered Mel frequency coefficients: DCT is performed on the logarithmized Mel frequency coefficients to obtain a set of MFCC coefficients. The MFCC coefficients represent the speech characteristics of the audio signal, expressed as follows:

其中，f_a表示第a个MFCC系数，M表示MFCC系数的个数。将MFCC系数作为声源定位任务的特征表示。Wherein, f_a represents the ath MFCC coefficient, and M represents the number of MFCC coefficients. Use MFCC coefficients as feature representations for sound source localization tasks.

(5)格式转换；(5) Format conversion;

将特征提取的数据转化为适合卷积神经网络处理的形式，即将频域信号转化为图像形式，使得数据被视为二维图像数据。通过对信号进行频谱分析，在频域上对信号进行幅度谱或相位谱的提取，然后使用二维快速傅里叶变换FFT将其转换为图像形式。幅度谱Q_m(k)和相位谱P_m(k)的计算公式如下：The feature extraction data is converted into a form suitable for convolutional neural network processing, that is, the frequency domain signal is converted into an image form, so that the data can be regarded as two-dimensional image data. By performing spectrum analysis on the signal, extract the magnitude spectrum or phase spectrum of the signal in the frequency domain, and then use the two-dimensional fast Fourier transform (FFT) to convert it into an image form. The calculation formulas of magnitude spectrum Q_m (k) and phase spectrum P_m (k) are as follows:

Q_m(k)＝|S_m(k)|Q_m (k)＝|S_m (k)|

P_m(k)＝arg{S_m(k)}P_m (k) = arg{S_m (k)}

其中|·|表示求模运算，arg{·}表示求幅角运算。将幅度谱或相位谱看作是二维的图像数据，通过在不同时间步上对幅度谱或相位谱进行堆叠，得到一个三维的数据集，其中第一维表示时间，第二维表示频率，第三维表示幅度或相位，作为卷积神经网络的输入数据。Where |·| represents the modulo operation, and arg{·} represents the argument operation. Considering the magnitude spectrum or phase spectrum as two-dimensional image data, a three-dimensional data set is obtained by stacking the magnitude spectrum or phase spectrum at different time steps, in which the first dimension represents time, and the second dimension represents frequency. The third dimension represents magnitude or phase, which is used as the input data of the convolutional neural network.

进一步地，所述步骤5的实施过程如下：Further, the implementation process of step 5 is as follows:

(1)数据输入；(1) Data input;

获取视频数据作为输入，将视频数据分割成T帧，将图像帧输入到YOLOv7网络中，对图像帧进行多次卷积和池化操作，得到一系列特征图。Obtain video data as input, divide the video data into T frames, input the image frames into the YOLOv7 network, perform multiple convolution and pooling operations on the image frames, and obtain a series of feature maps.

(2)Anchor框和特征图处理；(2) Anchor frame and feature map processing;

YOLOv7使用Anchor框来预测目标的位置和大小；将数据输入阶段得到的特征图输入到卷积层和池化层中进行处理，得到不同尺度的特征图。YOLOv7 uses the Anchor box to predict the position and size of the target; the feature map obtained in the data input stage is input into the convolutional layer and the pooling layer for processing to obtain feature maps of different scales.

(3)目标预测；(3) target prediction;

在特征图中，每个像素点对应了一个Anchor框，通过对每个Anchor框进行分类和回归，来预测每个框内是否存在目标物体，并估计其位置和大小。目标函数表示如下：In the feature map, each pixel corresponds to an Anchor box. By classifying and regressing each Anchor box, it is predicted whether there is a target object in each box, and its position and size are estimated. The objective function is expressed as follows:

L＝L_cls+λ_coordL_coord+λ_objL_obj+λ_noobjL_noobjL＝L_cls +λ_coord L_coord +λ_obj L_obj +λ_noobj L_noobj

其中，L_cls表示分类损失，L_coord表示位置损失，L_obj表示存在目标物体的损失，L_noobj表示不存在目标物体的损失。λ_coord、λ_obj、λ_noobj是权重参数。具体而言，分类损失使用交叉熵损失函数进行计算，位置损失使用均方误差损失函数进行计算，存在目标物体的损失和不存在目标物体的损失分别表示为：Among them, L_cls represents the classification loss, L_coord represents the position loss, L_obj represents the loss of the presence of the target object, and L_noobj represents the loss of the absence of the target object. λ_coord , λ_obj , λ_noobj are weight parameters. Specifically, the classification loss is calculated using the cross-entropy loss function, and the position loss is calculated using the mean square error loss function. The loss of the presence of the target object and the loss of the absence of the target object are expressed as:

其中，G表示特征图的大小，B表示每个像素点预测的Anchor框数量，表示第i个像素点上的第j个Anchor框是否存在目标物体，/>表示i个像素点上的第j个Anchor框是否不包含目标物体。x_i、y_i表示预测框的中心点坐标，/>表示实际目标物体中心点坐标。C_i表示预测框内是否存在目标物体的置信度，/>表示实际目标物体存在的置信度。Among them, G represents the size of the feature map, B represents the number of Anchor boxes predicted by each pixel, Indicates whether there is a target object in the j-th Anchor frame on the i-th pixel, /> Indicates whether the jth Anchor box on the i pixel does not contain the target object. x_i , y_i represent the coordinates of the center point of the prediction frame, /> Indicates the coordinates of the center point of the actual target object. C_i represents the confidence of whether there is a target object in the prediction frame, /> Indicates the confidence that the actual target object exists.

通过非极大值抑制NMS剔除冗余预测框，预测结果记为γ＝{b₁,b₂,…,b_T}，其中b_t＝(x_t,y_t,w_t,h_t)表示第t帧图像中指令发起者的中心坐标(x_t,y_t)和宽高(w_t,h_t)，以此确定指令发起者在每帧图像中的位置。The redundant prediction frame is eliminated by non-maximum value suppression NMS, and the prediction result is recorded as γ={b₁ ,b₂ ,…,b_T }, where b_t =(x_t ,y_t ,w_t ,h_t ) means The center coordinates (x_t , y_t ) and width and height (w_t , h_t ) of the instruction initiator in the t-th frame image are used to determine the position of the instruction initiator in each frame of image.

进一步地，所述步骤6中，将每帧图像中的指令发起者提取出来，通过AlphaPose对其进行手部关节点检测。经过AlphaPose后，对于第t帧图像，指令发起者手部对应的关节点信息记为U_t，其中表示第t帧图像中指令发起者手部的第d个关节点坐标。将每帧图像中指令发起者的手部关节点信息拼接起来，得到整个手势动作的关节点信息，记为U＝{U₁,U₂,…,U_T}。Further, in the step 6, the instruction initiator in each frame image is extracted, and hand joint point detection is performed on it by AlphaPose. After AlphaPose, for the t-th frame image, the joint point information corresponding to the hand of the instruction initiator is denoted as U_t , where Indicates the coordinates of the dth joint point of the command initiator's hand in the tth frame image. The joint point information of the command initiator's hand in each frame of image is spliced together to obtain the joint point information of the entire gesture action, which is recorded as U={U₁ , U₂ ,...,_UT }.

进一步地，所述步骤7中，时空图卷积网络得到指令发起者手部的关节点信息后，将每一帧中指令发起者手部的关节点信息U_t看作一个图，每个关节点作为一个图节点，图节点之间的边表示关节点之间的连接关系。将每个节点的坐标和时间信息拼接起来，得到一个三维张量其中T表示时间步数，D表示关键点个数，3表示每个节点的特征维度。使用时空图卷积神经网络对这个三维张量进行卷积操作，得到一个新的三维张量其中F表示特征通道数。将这个三维张量视为手势的特征表示，使用全局池化或者卷积神经网络进行分类或者回归任务。通过对所有帧中指令发起者手势的特征表示进行加权求和得到一个整体的手势表示，通过softmax函数将其转化为一个概率分布，即得到该手势属于每个类别的概率，由此得到最终的手势识别结果。Further, in the step 7, after the spatio-temporal graph convolutional network obtains the joint point information of the hand of the instruction initiator, the joint point information U_t of the hand of the instruction initiator in each frame is regarded as a graph, and each joint A node acts as a graph node, and the edges between graph nodes represent the connection relationship between the joint nodes. Concatenate the coordinates and time information of each node to get a three-dimensional tensor Where T represents the number of time steps, D represents the number of key points, and 3 represents the feature dimension of each node. Use the space-time graph convolutional neural network to perform a convolution operation on this 3D tensor to obtain a new 3D tensor where F represents the number of feature channels. Treat this 3D tensor as a feature representation of gestures, and use global pooling or convolutional neural networks for classification or regression tasks. An overall gesture representation is obtained by weighting and summing the feature representations of the gestures of the instruction initiator in all frames, and converted into a probability distribution through the softmax function, that is, the probability that the gesture belongs to each category is obtained, and the final Gesture recognition results.

在复杂的大规模室外场景下，由于噪声等因素的影响，现有的手势识别技术难以精准的定位并识别指令发起者的手势指令。为了解决这一问题，本发明使用音频信息进行声源定位，使用视觉信息进行手势识别，通过声像联合的方式，在复杂场景下精准定位并识别指令发起者的手势。In complex large-scale outdoor scenes, due to the influence of noise and other factors, it is difficult for existing gesture recognition technology to accurately locate and recognize the gesture command of the command initiator. In order to solve this problem, the present invention uses audio information for sound source localization, visual information for gesture recognition, and through the combination of sound and image, it can accurately locate and recognize the gesture of the instruction initiator in a complex scene.

附图说明Description of drawings

下面对照附图，通过对实施例的描述，对本发明的具体实施方式如所涉及的网络结构作进一步详细的说明，以帮助本领域的技术人员对本发明的发明构思、技术方案有更完整、准确和深入的理解。In the following, by referring to the accompanying drawings, through the description of the embodiments, the specific implementation manners of the present invention such as the network structure involved will be further described in detail, so as to help those skilled in the art to have a more complete and accurate understanding of the inventive concepts and technical solutions of the present invention. and deep understanding.

以下结合附图和实施例，对本发明进行较为详细的说明。The present invention will be described in more detail below in conjunction with the accompanying drawings and embodiments.

图1为本发明一种声像联合的跨模态声源定位与手势识别方法流程示意图；Fig. 1 is a schematic flow chart of a method for cross-modal sound source localization and gesture recognition of sound-image combination in the present invention;

图2为本发明音频信息预处理流程图；Fig. 2 is the flow chart of audio information preprocessing of the present invention;

图3为本发明用于声源定位的卷积网络模型结构参数图；Fig. 3 is a parameter diagram of the convolutional network model structure used for sound source localization in the present invention;

图4为本发明时空图卷积层结构图。Fig. 4 is a structure diagram of a spatio-temporal graph convolutional layer in the present invention.

具体实施方式Detailed ways

以下结合附图和实施例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

本发明采用的技术方案为一种声像联合的跨模态声源定位与手势识别方法，首先通过音频信息精准定位指令发起者的位置，然后通过视觉信息识别指令发起者的手势，从而完成相应的指令。The technical solution adopted in the present invention is a sound-image combined cross-modal sound source localization and gesture recognition method. Firstly, the location of the instruction initiator is precisely positioned through audio information, and then the gesture of the instruction initiator is recognized through visual information, thereby completing the corresponding instructions.

S1.建立空间极坐标系；S1. Establish a space polar coordinate system;

S2.将三维空间划分为不同的子空间；S2. Divide the three-dimensional space into different subspaces;

在空间极坐标系中极径为R的范围内，三维空间被分为Z个大小相等且互不相交的子空间，即每个子空间相互独立，且每个子空间都有唯一的三维坐标表示。若子空间越小，其数量就会越多，即Z值越大，分类复杂度越高，同时定位的准确率也会越高。考虑到子空间足够小，可以通过概率分布来处理三维空间声源定位问题，从而把线性回归问题转变成非线性分类问题，以减少计算量。通过从阵列接收的声源信号提取位置特征，可以使用不同的分类器来决定声源属于哪个子空间。Within the range of the polar radius R in the space polar coordinate system, the three-dimensional space is divided into Z subspaces of equal size and mutually disjoint, that is, each subspace is independent of each other, and each subspace has a unique three-dimensional coordinate representation. If the subspace is smaller, the number will be more, that is, the larger the Z value, the higher the classification complexity, and the higher the accuracy of positioning. Considering that the subspace is small enough, the problem of three-dimensional sound source localization can be dealt with through the probability distribution, so that the linear regression problem can be transformed into a nonlinear classification problem to reduce the amount of computation. By extracting positional features from the sound source signal received by the array, different classifiers can be used to decide which subspace the sound source belongs to.

S3.音频信息预处理；S3. Audio information preprocessing;

本发明中智能机器人装备多麦克风，采集音频信息前需要确定麦克风的布置方式，一般情况下，麦克风阵列的布置会对声源定位的精度产生影响。将麦克风放置在同一水平面上，相邻麦克风之间的距离为10cm，以保证声音信号的采集效果和精度。In the present invention, the intelligent robot is equipped with multiple microphones, and the layout of the microphones needs to be determined before collecting audio information. Generally, the layout of the microphone array will affect the accuracy of sound source localization. Place the microphones on the same level, and the distance between adjacent microphones is 10cm to ensure the collection effect and accuracy of the sound signal.

在收到多麦克风录制声音后，对音频信息进行预处理，首先将时域信号转化为频域信号，然后对频域信号进行特征提取，最后将特征变化为适合卷积神经网络处理的形式。主要流程如下：After receiving the sound recorded by multiple microphones, the audio information is preprocessed. First, the time domain signal is converted into a frequency domain signal, and then the frequency domain signal is feature extracted, and finally the feature is changed into a form suitable for convolutional neural network processing. The main process is as follows:

(1)预加重；(1) Pre-emphasis;

本发明对原始信号进行高通滤波，增强高频信号的能量，降低低频信号的能量；预加重的过程表示如下：The present invention performs high-pass filtering on the original signal, enhances the energy of the high-frequency signal, and reduces the energy of the low-frequency signal; the process of pre-emphasis is expressed as follows:

s'(n)＝s(n)-αs(n-1)s'(n)=s(n)-αs(n-1)

其中，s(n)表示输入信号，s'(n)表示预加重后的信号，α是预加重系数。Wherein, s(n) represents an input signal, s'(n) represents a pre-emphasized signal, and α is a pre-emphasis coefficient.

(2)分帧；(2) Framing;

将信号分割为若干个帧，每个帧的长度为固定的时间窗口。分帧过程表示如下：The signal is divided into several frames, and the length of each frame is a fixed time window. The framing process is expressed as follows:

s_m(n)＝s'(n)·w(n-mR)s_m (n) = s'(n) w(n-mR)

(4)STFT转换；(4) STFT conversion;

经过预加重和分帧操作后，本发明对声音信号进行STFT(Short-time FourierTransform)转换，将时域信号转化为频域信号。STFT是基于傅里叶变换的一种分析方法，它将长时间的信号分割为若干个短时段，每个短时段内的信号被看作是平稳的，并进行傅里叶变换。STFT转换可以将时域信号转化为频域信号，提取出不同频率下的声音特征，这是后续特征提取和声源定位的重要基础。After pre-emphasis and framing operations, the present invention performs STFT (Short-time Fourier Transform) conversion on the sound signal, and converts the time domain signal into a frequency domain signal. STFT is an analysis method based on Fourier transform, which divides a long-time signal into several short periods, and the signal in each short period is regarded as stationary, and undergoes Fourier transform. STFT conversion can convert time-domain signals into frequency-domain signals and extract sound features at different frequencies, which is an important basis for subsequent feature extraction and sound source localization.

具体而言，本发明将时域信号分段，每一段的长度为窗口长度，相邻两段之间有一定的重叠，对每一段信号进行傅里叶变换得到其频域表示，即将信号在频域上分解为若干个频带。STFT转换的过程用公式表示如下：Specifically, the present invention divides the time-domain signal into sections, the length of each section is the window length, and there is a certain overlap between two adjacent sections, and performs Fourier transform on each section of the signal to obtain its frequency-domain representation, that is, the signal in It is decomposed into several frequency bands in the frequency domain. The process of STFT conversion is expressed as follows:

其中，S_m(k)表示第m帧的频域表示，N表示每一帧的采样点数，k＝n-mR。Wherein, S_m (k) represents the frequency domain representation of the mth frame, N represents the number of sampling points in each frame, and k=n-mR.

(4)特征提取；(4) feature extraction;

在声源定位的过程中，对频域信号进行特征提取是非常重要的一步。特征提取可以从频域信号中提取出具有代表性的特征，为后续的声源定位任务提供可靠的输入数据。本发明采用MFCC(Mel Frequency Cepstral Coefficients)方法提取出不同频率下的声音特征。In the process of sound source localization, feature extraction of frequency domain signal is a very important step. Feature extraction can extract representative features from frequency domain signals, providing reliable input data for subsequent sound source localization tasks. The present invention adopts MFCC (Mel Frequency Cepstral Coefficients) method to extract sound features at different frequencies.

接着，对滤波后的Mel频率系数进行离散余弦变换(DCT)。DCT是一种将时域信号转换为频域信号的技术，可以将Mel频率系数转换为MFCC系数。具体而言，对取对数后的Mel频率系数进行DCT变换，得到一组MFCC系数，这些系数可以表示音频信号的语音特征，公式表示如下：Next, a discrete cosine transform (DCT) is performed on the filtered Mel frequency coefficients. DCT is a technique that converts time-domain signals into frequency-domain signals, and can convert Mel frequency coefficients into MFCC coefficients. Specifically, the DCT transformation is performed on the logarithmic Mel frequency coefficients to obtain a set of MFCC coefficients. These coefficients can represent the speech characteristics of the audio signal. The formula is expressed as follows:

其中，f_a表示第a个MFCC系数，M表示MFCC系数的个数。本发明将MFCC系数作为声源定位任务的特征表示。Wherein, f_a represents the ath MFCC coefficient, and M represents the number of MFCC coefficients. The present invention uses MFCC coefficients as the feature representation of the sound source localization task.

(5)格式转换；(5) Format conversion;

将提取的特征数据转化为适合卷积神经网络处理的形式，即将频域信号转化为图像形式，使得数据可以被视为二维图像数据。本发明通过对信号进行频谱分析，在频域上对信号进行幅度谱或相位谱的提取，然后使用二维快速傅里叶变换(FFT)将其转换为图像形式。幅度谱Q_m(k)和相位谱P_m(k)的计算公式如下：Transform the extracted feature data into a form suitable for convolutional neural network processing, that is, convert the frequency domain signal into an image form, so that the data can be regarded as two-dimensional image data. The invention extracts the magnitude spectrum or phase spectrum of the signal in the frequency domain by analyzing the frequency spectrum of the signal, and then converts it into an image form by using two-dimensional fast Fourier transform (FFT). The calculation formulas of magnitude spectrum Q_m (k) and phase spectrum P_m (k) are as follows:

Q_m(k)＝|S_m(k)|Q_m (k)＝|S_m (k)|

P_m(k)＝arg{S_m(k)}P_m (k) = arg{S_m (k)}

其中|·|表示求模运算，arg{·}表示求幅角运算。本发明将幅度谱或相位谱看作是二维的图像数据，通过在不同时间步上对幅度谱或相位谱进行堆叠，得到一个三维的数据集，其中第一维表示时间，第二维表示频率，第三维表示幅度或相位，从而可以作为卷积神经网络的输入数据。Where |·| represents the modulo operation, and arg{·} represents the argument operation. The present invention regards the magnitude spectrum or phase spectrum as two-dimensional image data, and obtains a three-dimensional data set by stacking the magnitude spectrum or phase spectrum at different time steps, wherein the first dimension represents time, and the second dimension represents Frequency, the third dimension represents magnitude or phase, which can be used as input data for convolutional neural networks.

S4.卷积神经网络模型；S4. Convolutional neural network model;

在预处理后的声音信号上，使用卷积神经网络模型进行训练。本发明构造的卷积神经网络模型由4个卷积层和4个池化层组成，其中卷积层主要用于对不同的局部矩阵和输入图像的卷积核矩阵进行卷积运算，以提取图像特征；池化层主要用于压缩卷积层提取的特征，减小特征维数，从而减少计算量，防止过拟合，提高计算速度。On the preprocessed sound signal, a convolutional neural network model is used for training. The convolutional neural network model constructed by the present invention is composed of 4 convolutional layers and 4 pooling layers, wherein the convolutional layer is mainly used to perform convolution operations on different local matrices and the convolution kernel matrix of the input image to extract Image features; the pooling layer is mainly used to compress the features extracted by the convolutional layer, reduce the feature dimension, thereby reducing the amount of calculation, preventing over-fitting, and improving the calculation speed.

本发明将格式变换后的三维数据作为卷积神经网络的输入，卷积神经网络的输出作为训练特征，记为f'。然后根据训练特征进行分类，确定声源所在子空间，实现声源定位。此过程表示为：In the present invention, the format-transformed three-dimensional data is used as the input of the convolutional neural network, and the output of the convolutional neural network is used as the training feature, denoted as f'. Then classify according to the training features, determine the subspace where the sound source is located, and realize sound source localization. This process is expressed as:

g_z＝classify(f')g_z = classify(f')

其中，classify(·)表示分类器函数，g_z表示预测目标声源所在的子空间。最后根据子空间的位置，智能机器人移动到指令发起者面前，其中，智能机器人旋转的方向角为移动距离为r·cosθ。Among them, classify( ) represents the classifier function, and g_z represents the subspace in which the target sound source is predicted. Finally, according to the position of the subspace, the intelligent robot moves to the front of the instruction initiator, where the rotation direction angle of the intelligent robot is The moving distance is r·cosθ.

S5.使用YOLOv7进行人体检测；S5. Use YOLOv7 for human detection;

YOLOv7是一种高效的目标检测算法，可以快速而准确地定位图像中的物体。在本发明中，我们使用YOLOv7对视频中的人体进行检测。YOLOv7可以在较短的时间内处理大量图像，并准确地定位人体。通过YOLOv7，我们可以获得指令发起者的位置信息，为后续手势识别做准备。其主要流程如下：YOLOv7 is an efficient object detection algorithm that can quickly and accurately locate objects in images. In this invention, we use YOLOv7 to detect human bodies in videos. YOLOv7 can process a large number of images in a short period of time and accurately locate the human body. Through YOLOv7, we can obtain the location information of the instruction initiator to prepare for subsequent gesture recognition. Its main process is as follows:

(1)数据输入；(1) Data input;

(2)Anchor框和特征图处理；(2) Anchor frame and feature map processing;

为了检测不同大小和比例的目标物体，YOLOv7使用Anchor框来预测目标的位置和大小。Anchor框是一组固定大小和比例的框，覆盖了输入图像的不同区域。此外，将数据输入阶段得到的特征图输入到卷积层和池化层中进行处理，得到不同尺度的特征图。In order to detect target objects of different sizes and proportions, YOLOv7 uses the Anchor box to predict the position and size of the target. Anchor boxes are a set of boxes of fixed size and scale, covering different regions of the input image. In addition, the feature maps obtained in the data input stage are input into the convolution layer and the pooling layer for processing to obtain feature maps of different scales.

(3)目标预测；(3) target prediction;

其中，L_cls表示分类损失，L_coord表示位置损失，L_obj表示存在目标物体的损失，L_noobj表示不存在目标物体的损失。λ_coord、λ_obj、λ_boobj是权重参数。具体而言，分类损失使用交叉熵损失函数进行计算，位置损失使用均方误差损失函数进行计算，存在目标物体的损失和不存在目标物体的损失分别表示为：Among them, L_cls represents the classification loss, L_coord represents the position loss, L_obj represents the loss of the presence of the target object, and L_noobj represents the loss of the absence of the target object. λ_coord , λ_obj , λ_boobj are weight parameters. Specifically, the classification loss is calculated using the cross-entropy loss function, and the position loss is calculated using the mean square error loss function. The loss of the presence of the target object and the loss of the absence of the target object are expressed as:

此外，在预测结果中，可能存在多个预测框覆盖同一目标物体的情况。因此，我们需要通过非极大值抑制(NMS)来剔除冗余的预测框，只保留最佳的预测结果。预测结果记为β＝{b₁,b₂,…,b_T}，其中b_t＝(x_t,y_t,w_t,h_t)表示第t帧图像中指令发起者的中心坐标(x_t,y_t)和宽高(w_t,h_t)，以此确定指令发起者在每帧图像中的位置。In addition, in the prediction results, there may be cases where multiple prediction boxes cover the same target object. Therefore, we need to eliminate redundant prediction boxes through non-maximum suppression (NMS), and only keep the best prediction results. The prediction result is recorded as β={b₁ ,b₂ ,…,b_T }, where b_t =(x_t ,y_t ,w_t ,h_t ) represents the center coordinate of the instruction initiator in the t-th frame image (x_t ,y_t ) and width and height (w_t ,h_t ), to determine the position of the instruction initiator in each frame of image.

S6.使用AlphaPose提取手势的骨骼信息；S6. Use AlphaPose to extract the skeletal information of the gesture;

AlphaPose是一种人体姿态估计算法，可以快速、准确地提取人体关节点信息。在本发明中，使用AlphaPose对YOLOv7检测到的人体进行手部关节点提取，得到指令发起者手部的关节点信息。这些关节点信息包括手指、手腕、肘部、肩膀等关键点位置和姿态，为后续手势识别提供重要的特征。AlphaPose is a human body pose estimation algorithm that can quickly and accurately extract human body joint point information. In the present invention, AlphaPose is used to extract the joint points of the human body detected by YOLOv7 to obtain the joint point information of the hand of the instruction initiator. These joint point information includes the position and posture of key points such as fingers, wrists, elbows, and shoulders, which provide important features for subsequent gesture recognition.

本发明将每帧图像中的指令发起者提取出来，通过AlphaPose对其进行手部关节点检测。经过AlphaPose后，对于第t帧图像，指令发起者手部对应的关节点信息记为U_t，其中表示第t帧图像中指令发起者手部的第d个关节点坐标。我们可以将每帧图像中指令发起者的手部关节点信息拼接起来，得到整个手势动作的关节点信息，记为U＝{U₁,U₂,…,U_T}。The invention extracts the instruction initiator in each frame image, and detects the hand joint points through AlphaPose. After AlphaPose, for the t-th frame image, the joint point information corresponding to the hand of the instruction initiator is denoted as U_t , where Indicates the coordinates of the dth joint point of the command initiator's hand in the tth frame image. We can concatenate the hand joint point information of the command initiator in each frame image to obtain the joint point information of the entire gesture action, which is recorded as U={U₁ , U₂ ,...,_UT }.

S7.使用时空图卷积网络进行手势识别；S7. Gesture recognition using spatio-temporal graph convolutional network;

时空图卷积网络是一种基于图卷积神经网络的视频动作识别算法，可以有效地利用时间和空间信息，对视频进行动作分类。在本发明中，使用时空图卷积网络对AlphaPose提取的手势关节点信息进行建模和处理，从而识别手势的类别。时空图卷积网络使用了多层图卷积神经网络，并在空间和时间上进行卷积运算，从而提取出手势的空间和时间特征，在手势识别方面具有较高的准确率和鲁棒性。Spatio-temporal graph convolutional network is a video action recognition algorithm based on graph convolutional neural network, which can effectively use temporal and spatial information to classify video actions. In the present invention, the spatial-temporal graph convolutional network is used to model and process the gesture joint point information extracted by AlphaPose, so as to identify the category of gestures. The spatio-temporal graph convolutional network uses a multi-layer graph convolutional neural network and performs convolution operations in space and time to extract the spatial and temporal features of gestures, which has high accuracy and robustness in gesture recognition .

具体而言，得到指令发起者手部的关节点信息后，将每一帧中指令发起者手部的关节点信息U_t看作一个图，每个关节点作为一个图节点，图节点之间的边表示关节点之间的连接关系。本发明将每个节点的坐标和时间信息拼接起来，得到一个三维张量其中T表示时间步数，D表示关键点个数，3表示每个节点的特征维度(包括xy坐标和时间信息)。本发明使用时空图卷积神经网络对这个三维张量进行卷积操作，得到一个新的三维张量/>其中F表示特征通道数。本发明将这个三维张量视为手势的特征表示，使用全局池化或者卷积神经网络进行分类或者回归任务。Specifically, after obtaining the joint point information of the command initiator’s hand, the joint point information U_t of the command initiator’s hand in each frame is regarded as a graph, and each joint point is regarded as a graph node, and the relationship between graph nodes The edges represent the connection relationship between the joint nodes. The present invention splices the coordinates and time information of each node to obtain a three-dimensional tensor Where T represents the number of time steps, D represents the number of key points, and 3 represents the feature dimension of each node (including xy coordinates and time information). The present invention uses a space-time graph convolutional neural network to perform convolution operations on this three-dimensional tensor to obtain a new three-dimensional tensor /> where F represents the number of feature channels. The present invention regards this three-dimensional tensor as a feature representation of gestures, and uses global pooling or convolutional neural networks to perform classification or regression tasks.

最后，通过对所有帧中指令发起者手势的特征表示进行加权求和得到一个整体的手势表示，通过softmax函数将其转化为一个概率分布，即得到该手势属于每个类别的概率，由此得到最终的手势识别结果。Finally, an overall gesture representation is obtained by weighting and summing the feature representations of the gestures of the instruction initiator in all frames, and converted into a probability distribution through the softmax function, that is, the probability that the gesture belongs to each category is obtained, and thus The final gesture recognition result.

实施例1流程示意图如图1所示。The schematic flow chart of Embodiment 1 is shown in FIG. 1 .

实施例2音频信息预处理流程图如图2所示。在本发明中，将音频信息输入到卷积神经网络之前需要进行音频信息预处理，将音频信息的格式转化为适合卷积神经网络输入的格式。具体流程如下：The flowchart of audio information preprocessing in Embodiment 2 is shown in FIG. 2 . In the present invention, audio information preprocessing is required before inputting the audio information into the convolutional neural network, and the format of the audio information is converted into a format suitable for the input of the convolutional neural network. The specific process is as follows:

预加重：预加重是一种高通滤波器，可以强化高频信号，减弱低频信号，使得音频信号在后续处理中更加稳定。Pre-emphasis: Pre-emphasis is a high-pass filter that can strengthen high-frequency signals and weaken low-frequency signals, making the audio signal more stable in subsequent processing.

分帧：分帧是将音频信号划分为若干段，每段称为一帧，以便后续处理。在分帧时需要设置帧长和帧移参数，通常选择帧长为20-40ms，帧移为10-20ms。Framing: Framing is to divide the audio signal into several segments, and each segment is called a frame for subsequent processing. Frame length and frame shift parameters need to be set during framing. Usually, the frame length is 20-40ms, and the frame shift is 10-20ms.

STFT转换：在每一帧上进行傅里叶变换，得到该帧的频域信号。通常使用短时傅里叶变换(STFT)进行计算。STFT conversion: Fourier transform is performed on each frame to obtain the frequency domain signal of the frame. The computation is usually performed using the short-time Fourier transform (STFT).

特征提取：将频域信号通过Mel滤波器组进行滤波，Mel滤波器组是一组对信号进行频域滤波的滤波器，可以将频域信号转换为Mel频率系数。在本发明中，选择使用40个Mel滤波器。对滤波后的Mel频率系数进行离散余弦变换(DCT)，DCT可以将Mel频率系数转换为MFCC系数。在本发明中，我们选择使用13个MFCC系数，作为音频信息的特征。Feature extraction: The frequency domain signal is filtered through the Mel filter bank. The Mel filter bank is a set of filters for frequency domain filtering of the signal, which can convert the frequency domain signal into Mel frequency coefficients. In the present invention, 40 Mel filters are chosen to be used. A discrete cosine transform (DCT) is performed on the filtered Mel frequency coefficients, and the DCT can convert the Mel frequency coefficients into MFCC coefficients. In the present invention, we choose to use 13 MFCC coefficients as features of audio information.

格式转换：将频域信号特征转换为图像形式，使得数据可以被视为二维图像数据。一般的转换方式为在频域上对信号进行幅度谱或相位谱的提取，然后使用二维快速傅里叶变换(FFT)将其转换为图像形式。Format conversion: Convert frequency-domain signal features into image form so that the data can be viewed as two-dimensional image data. The general conversion method is to extract the magnitude spectrum or phase spectrum of the signal in the frequency domain, and then use the two-dimensional fast Fourier transform (FFT) to convert it into an image form.

实施例3用于声源定位的卷积网络模型结构参数图如图3所示。在本发明中，kernel_size表示卷积核大小，stride表示步长，pad表示边缘扩充参数，pooling表示池化方式，dropout表示神经元失效率，iterations表示迭代次数，batch_size表示批尺寸。Embodiment 3 The structural parameter diagram of the convolutional network model used for sound source localization is shown in FIG. 3 . In the present invention, kernel_size represents the size of the convolution kernel, stride represents the step size, pad represents the edge expansion parameter, pooling represents the pooling method, dropout represents the neuron failure rate, iterations represents the number of iterations, and batch_size represents the batch size.

实施例4时空图卷积层结构图如图4所示。在本发明中，整个模型是以反向传播的方式从头到尾训练的。具体而言，时空图卷积分为空间图卷积和时间图卷积，其中空间图卷积是核心部分，时间图卷积包括两个BN层、一个ReLU激活层、一个Dropout层、一个卷积层。一个空间图卷积加上一个时间卷积就是一层，一共10层，但是第一层没有残差结构。Embodiment 4 The spatio-temporal graph convolutional layer structure diagram is shown in FIG. 4 . In the present invention, the whole model is trained from the beginning to the end in the way of backpropagation. Specifically, the spatial-temporal graph convolution is divided into spatial graph convolution and temporal graph convolution, in which spatial graph convolution is the core part, and temporal graph convolution includes two BN layers, a ReLU activation layer, a Dropout layer, and a convolution layer. A spatial image convolution plus a temporal convolution is one layer, a total of 10 layers, but the first layer has no residual structure.