CN108846332A

Movatterモバイル変換

Info

Publication number: CN108846332A
Application number: CN201810540015.5A
Authority: CN
Inventors: 唐鹏; 胡超; 金炜东
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-20
Anticipated expiration: 2038-05-30
Also published as: CN108846332B

Abstract

Translated fromChinese

本发明公开一种基于CLSTA的铁路司机行为识别方法，提出了一种CLSTA神经网络模型，并将CLSTA网络移植到工控电脑中，利用司机室内的监控视频对机车司机的行为进行识别和理解，实时监测和智能评估机车司机的驾驶行为和驾驶状态；利用卷积神经网络CNN和长短记忆神经网络LSTM对机车司机行为的视频图像进行空间特征学习和时序特征学习，并考虑到司机室内环境单一，肢体动作对于整个场景而言变化甚微，针对这种现状，提出了改进的时空注意力方法STA，通过大量数据集训练得到神经网络模型，最后将该模型运用于工控电脑中，分析机车司机驾驶过程中的常见行为和异常行为，例如疲劳驾驶、玩手机、抽烟等，最后实现对机车司机行为理解的目的。

The invention discloses a CLSTA-based railway driver behavior recognition method, proposes a CLSTA neural network model, and transplants the CLSTA network into an industrial control computer, uses the monitoring video in the driver's room to recognize and understand the behavior of locomotive drivers, real-time Monitor and intelligently evaluate the driving behavior and driving state of the locomotive driver; use the convolutional neural network CNN and the long-short memory neural network LSTM to perform spatial feature learning and temporal feature learning on the video image of the locomotive driver's behavior, and consider the driver's indoor environment. The action changes little for the whole scene. In view of this situation, an improved spatio-temporal attention method STA is proposed, and a neural network model is obtained through a large amount of data set training. Finally, the model is applied to the industrial computer to analyze the driving process of the locomotive driver. Common behaviors and abnormal behaviors in the vehicle, such as fatigue driving, playing with mobile phones, smoking, etc., and finally achieve the purpose of understanding the behavior of locomotive drivers.

Description

Translated fromChinese

一种基于CLSTA的铁路司机行为识别方法A Behavior Recognition Method of Railway Drivers Based on CLSTA

技术领域technical field

本发明涉及铁路行车安全检测技术领域，具体为一种基于CLSTA(ConvolutionalLSTM Networks With Spatial-Temporal Attention具有时空注意力的LSTM卷积神经网络))的铁路司机行为识别方法。The invention relates to the technical field of railway traffic safety detection, in particular to a railway driver behavior recognition method based on CLSTA (ConvolutionalLSTM Networks With Spatial-Temporal Attention LSTM Convolutional Neural Network)).

背景技术Background technique

我国铁路的建设事业正在进入以“跨越式发展”为特征的高速发展时期，机车运行安全保障技术提出了更高的要求。如何确保机车的平稳运行己经成为铁路运输部门工作的重中之重，进和提高铁路机务部门对机车运行安全的监控管理水平已经成为当务之急。my country's railway construction is entering a period of high-speed development characterized by "leap-forward development", and locomotive operation safety technology has put forward higher requirements. How to ensure the smooth operation of locomotives has become the top priority of the railway transportation department, and it has become a top priority to improve and improve the monitoring and management level of the locomotive operation safety in the railway maintenance department.

众所周知，突发性的设备故障例如机车车辆切轴、线路断轨等或自然灾害外，列车运行安全最大的威胁来自于列车运行信号是否正确指示和司机是否正确操纵机车。从以往发生的列车冲突、追尾、超速引起列车颠覆等重大行车事故的直接原因看，车信号显示错误或司机失去警惕导致错误操纵列车占主要比例。我国铁路系统事故统计表明，车事故的人为因素中有相当一部分是由于司乘人员的操作不当引起。其中，驶员的驾驶行为不当，疲劳驾驶、睡觉、违规操作、不良驾驶习惯等是引起行车安全事故的重要原因之一。列车司机因运输生产任务繁重，工作环境艰苦，作息时间不规则，常年处于高强负荷、高度紧张、高速运转的状态中，在驾驶过程中也极易出现其他不正当操作。As we all know, in addition to sudden equipment failures such as axle cutting of locomotives, broken tracks, etc., or natural disasters, the biggest threat to train operation safety comes from whether the train operation signals are correctly indicated and whether the driver operates the locomotive correctly. Judging from the direct causes of major traffic accidents such as train conflicts, rear-end collisions, and overspeeding caused by train overturning in the past, the main proportion of trains being mismanaged due to wrong signal display or driver losing vigilance. Statistics on accidents in my country's railway system show that a considerable part of the human factors in train accidents is caused by improper operation of drivers and passengers. Among them, the driver's improper driving behavior, fatigue driving, sleeping, illegal operation, bad driving habits, etc. are one of the important reasons for driving safety accidents. Due to heavy transportation and production tasks, difficult working environment and irregular work and rest schedules, train drivers are always in a state of high-intensity load, high tension, and high-speed operation. They are also prone to other improper operations during driving.

我国铁路的行车安全监测近年来取得了长足的进步，但与发达国家相比还有较大的差距，主要体现在监测各类信息的准确性、实时性差，对司机个人的工作状态没有识别、报警，系统功能不能满足要求等。实时监测和智能评估机车司机的驾驶行为和驾驶状态，有助于及早发现可能的操作失误，对减少安全事故及人员伤亡有着十分重要的现实意义。该系统能够帮助司机更加专注于驾驶机车，测评出司机在驾驶过程中的驾驶行为，在其出现疲劳驾驶或者异常操作时能够发出警报，使其能够更加安全的操控机车。同时该系统还可以为地面管理部门提供机车运行动态数据的实时监控，在异常发生的情况下对机车司机的工作状态进行实时的监督和全程的记录，时掌握非正常状况下整个机车的运行状况，提高对机车运行安全的监管能力。my country's railway traffic safety monitoring has made great progress in recent years, but there is still a big gap compared with developed countries, mainly reflected in the accuracy and real-time monitoring of various information, and the failure to identify the driver's personal working status, Alarm, system function can not meet the requirements, etc. Real-time monitoring and intelligent evaluation of the driving behavior and driving status of locomotive drivers can help to detect possible operating errors early, and have very important practical significance for reducing safety accidents and casualties. The system can help the driver to focus more on driving the locomotive, evaluate the driver's driving behavior during driving, and send an alarm when the driver experiences fatigue driving or abnormal operation, so that he can control the locomotive more safely. At the same time, the system can also provide the ground management department with real-time monitoring of the dynamic data of locomotive operation. In case of abnormality, it can monitor the working status of the locomotive driver in real time and record the whole process, and grasp the operating status of the entire locomotive under abnormal conditions. , Improve the ability to supervise the safety of locomotive operation.

发明内容Contents of the invention

针对上述问题，本发明的目的在于提供一种利用司机室内的监控视频对机车司机的行为进行识别和理解，实时监测和智能评估机车司机的驾驶行为和驾驶状态的基于CLSTA的铁路司机行为识别方法。技术方案如下：At the problems referred to above, the object of the present invention is to provide a kind of monitoring video in the driver's room to identify and understand the behavior of the locomotive driver, monitor and intelligently evaluate the driving behavior and driving state of the locomotive driver in real time. The railway driver's behavior recognition method based on CLSTA . The technical solution is as follows:

一种基于CLSTA的铁路司机行为识别方法，其特征在于，包括以下步骤：A kind of railway driver's behavior identification method based on CLSTA, it is characterized in that, comprises the following steps:

步骤1：根据司机室内的环境和司机常见行为的特点，建立改进的时空注意力网络STA，并设计网络的拓扑结构；所述改进的时空注意力网络STA包括空间注意力子网络SA和时间注意力子网络TA；Step 1: According to the environment in the driver's cabin and the characteristics of common behaviors of drivers, an improved spatio-temporal attention network STA is established, and the topology of the network is designed; the improved spatio-temporal attention network STA includes a spatial attention sub-network SA and temporal attention Lizi Network TA;

步骤2：将空间注意力子网络SA和时间注意力子网络TA融合Main LSTM网络，得到新的CLSTA神经网络模型，并设计网络的拓扑结构；所述Main LSTM网络由Main CNN网络和两层LSTM网络级联组成；Step 2: Merge the spatial attention subnetwork SA and the temporal attention subnetwork TA into the Main LSTM network to obtain a new CLSTA neural network model, and design the topology of the network; the Main LSTM network consists of a Main CNN network and a two-layer LSTM Network cascade composition;

步骤3：利用机车司机常见行为视频采集样本作为数据集，输入到所述CLSTA神经网络模型中，训练model；将得到的model运用于工控计算机中，进行机车司机行为的监测识别。Step 3: Use video samples of common behaviors of locomotive drivers as a data set, input them into the CLSTA neural network model, and train the model; apply the obtained model to an industrial computer to monitor and identify the behavior of locomotive drivers.

进一步的，所述空间注意力子网络SA通过基于AlexNet网络的卷积神经网络CNN实现空间特征的提取，所述AlexNet网络包括五个卷积层和一个全连接层fc6，共六个学习层；所述空间注意力子网络SA为双流CNN结构，分别是CNN1和CNN2，用于分别提取当前图片流的空间特征，CNN1、CNN2各有六个学习层；CNN1处理的是当前帧的图片流x_t，将当前图片帧x_t输入到CNN1中；CNN2处理上一帧的图片x_t-1，将上一帧的图片x_t-1输入到CNN2中；再通过一个eltwise进行减法操作，将CNN1输出的特征维度减去CNN2的输出特征维度，eltwise层的输出接一个全连接层Fc_layer1中。Further, the spatial attention sub-network SA realizes the extraction of spatial features through the convolutional neural network CNN based on the AlexNet network, the AlexNet network includes five convolutional layers and a fully connected layer fc6, a total of six learning layers; The spatial attention sub-network SA is a two-stream CNN structure, respectively CNN1 and CNN2, which are used to extract the spatial features of the current picture stream respectively. CNN1 and CNN2 each have six learning layers; what CNN1 processes is the picture stream x of the current frame_t , input the current picture frame x_t into CNN1; CNN2 processes the picture x_t-1 of the previous frame, and inputs the picture x_t -1 of the previous frame into CNN2; and then performs a subtraction operation through an eltwise, CNN1 The output feature dimension is subtracted from the output feature dimension of CNN2, and the output of the eltwise layer is connected to a fully connected layer Fc_layer1.

更进一步的，所述时间注意力子网络TA中为双流CNN+LSTM结构，分别是CNN1+LSTM1和CNN2+LSTM2，用于分别提取当前图片流的时序特征；将当前图片流x_t输入到CNN1中进行空间特征学习，再将CNN1的输出输入到LSTM1中进行时序学习；将当前的图片的上一帧图片x_t-1输入到CNN2中进行空间特征学习，再将CNN2的输出输入到LSTM2中进行时序学习；再通过一个eltwise层进行减法操作，将LSTM1输出的特征维度减去LSTM2输出的特征维度，接着将eltwise层的输出接入到一个全连接层Fc_layer2中。Further, the temporal attention sub-network TA has a dual-stream CNN+LSTM structure, which are respectively CNN1+LSTM1 and CNN2+LSTM2, which are used to extract the time sequence features of the current picture stream respectively; the current picture stream x_{t is} input to CNN1 Carry out spatial feature learning in CNN1, and then input the output of CNN1 into LSTM1 for timing learning; input the previous frame of the current picture x_t-1 into CNN2 for spatial feature learning, and then input the output of CNN2 into LSTM2 Carry out time series learning; then perform subtraction through an eltwise layer, subtract the feature dimension output by LSTM1 from the feature dimension output by LSTM2, and then connect the output of the eltwise layer to a fully connected layer Fc_layer2.

更进一步的，所述步骤2的具体步骤包括：Further, the specific steps of said step 2 include:

步骤21：将当前图片流x_t输入到Main CNN中，提取当前图片流的空间特征；Step 21: Input the current image stream x_t into the Main CNN to extract the spatial features of the current image stream;

步骤22：将空间注意力子网络SA的输出与Main CNN的输出融合，融合的方式是通过eltwise层做加法运算；Step 22: Fuse the output of the spatial attention sub-network SA with the output of the Main CNN, and the way of fusion is to do addition through the eltwise layer;

步骤23：将步骤22融合后输出的特征维度输入到Main LSTM网络中进行时序特征学习，所述Main LSTM网络由2层LSTM网络级联而成，LSTM1的输入即是步骤22的输出；再将LSTM1输出的特征维度输入到LSTM2中；Step 23: Input the feature dimension output after the fusion of step 22 into the Main LSTM network for time series feature learning, the Main LSTM network is formed by cascading two layers of LSTM networks, and the input of LSTM1 is the output of step 22; The feature dimension output by LSTM1 is input into LSTM2;

步骤24：将时间注意力子网络TA的输出与步骤23中的Main LSTM网络的输出进行融合，融合的方式是通过eltwise层做加法；融合后再接Fc_layer3，最后进行分类。Step 24: Fuse the output of the temporal attention sub-network TA with the output of the Main LSTM network in step 23. The fusion method is addition through the eltwise layer; after fusion, connect to Fc_layer3, and finally perform classification.

更进一步的，所述步骤3的具体步骤包括：Further, the specific steps of said step 3 include:

步骤31：通过工业摄像机采集环境视频；Step 31: collect environmental video through industrial cameras;

步骤32：工控计算机中的脚本程序分解视频为图片帧，FPS为5；Step 32: the script program in the industrial computer decomposes the video into picture frames, and the FPS is 5;

步骤33：将分解的每连续16帧送入model中测试；Step 33: send every 16 consecutive frames decomposed into the model for testing;

步骤34：输出测试结果，并制作报表。Step 34: Output test results and make reports.

本发明的有益效果是：本发明针对这种现状，提出了一种改进的时空注意力方法STA(Spatial-Temporal Attention)用于解决这一问题，通过大量数据集训练得到神经网络模型，最后将该模型运用于工控电脑中，分析机车司机驾驶过程中的常见行为和异常行为，例如“正常驾驶”、“疲劳驾驶”、“玩手机”、“抽烟”等，最后达到实现对机车司机行为理解的目的。The beneficial effect of the present invention is: the present invention proposes an improved spatio-temporal attention method STA (Spatial-Temporal Attention) to solve this problem, obtains the neural network model through a large number of data set trainings, finally will This model is applied to industrial computers to analyze the common and abnormal behaviors of locomotive drivers during driving, such as "normal driving", "fatigue driving", "playing with mobile phones", "smoking", etc., and finally achieves an understanding of the behavior of locomotive drivers the goal of.

附图说明Description of drawings

图1为一种基于CLSTA的铁路司机行为识别方法的流程示意图。Fig. 1 is a flow diagram of a CLSTA-based railway driver behavior recognition method.

图2为LSTM网络单元内部的结构示意图。Fig. 2 is a schematic diagram of the internal structure of the LSTM network unit.

图3为CLSTA网络拓扑结构框图。Fc_layer是全连接层，Relu是激活层(它不包含为主要学习层)。FIG. 3 is a block diagram of a CLSTA network topology. Fc_layer is a fully connected layer, and Relu is an activation layer (it is not included as the main learning layer).

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明做进一步详细说明。摄像机可以收集空间密集的数据，并且提供了以较低精度为代价远程测量的机会，相对便宜并且可以快速监测。本发明的基本思想是在利用机车司机室内安装的摄像机，实时采集机车司机行为的视频，采集的视频会被系统程序分解为连续的图片帧，然后将连续图片输入到已经训练好的CLSTA网络模型中进行测试识别，测试内容主要包含分析机车司机驾驶过程中的常见行为和异常行为，如“正常驾驶”、“疲劳驾驶”、“玩手机”、“抽烟”、“离职”等常见行为，并制作报表。CLSTA模型具有学习连续图片的空间特性和时序特性的能力，时序特性是通过连续的图片表现出来的，在本实施例中，模型每次处理连续的16张图片。虽然环境实际上是静止和刚性的，在相机视场内，连续的图片经过时序处理后，司机动作部分区域将是动态的。The present invention will be described in further detail below in conjunction with the accompanying drawings and specific embodiments. Cameras can collect spatially dense data and offer the opportunity for remote measurements at the expense of less precision, are relatively cheap and can be monitored quickly. The basic idea of the present invention is to use the camera installed in the cab of the locomotive to collect the video of the locomotive driver's behavior in real time. The collected video will be decomposed into continuous picture frames by the system program, and then the continuous pictures will be input into the trained CLSTA network model The test identification is carried out in the test. The test content mainly includes the analysis of common behaviors and abnormal behaviors of locomotive drivers during driving, such as "normal driving", "fatigue driving", "playing with mobile phones", "smoking", "resignation" and other common behaviors, and Make reports. The CLSTA model has the ability to learn the spatial characteristics and timing characteristics of continuous pictures, and the timing characteristics are expressed through continuous pictures. In this embodiment, the model processes 16 continuous pictures each time. Although the environment is actually static and rigid, in the field of view of the camera, after the continuous pictures are processed sequentially, part of the area where the driver moves will be dynamic.

详细包括以下步骤：Details include the following steps:

步骤1：提出一种改进的时空注意力网络STA(Spatial-Temporal Attention)，并设计网络的拓扑结构。STA网络主要由空间注意力子网络SA(Spatial Attention)和时间注意力子网络TA(Temporal Attention)组成。记T为CLSTA网络处理的时间步总数，在该实验中为输入CLSTA网络连续图片的帧数，本发明实验中取得是T＝16，即是每次处理16张连续的图片。附图3为CLSTA网络拓扑结构框图，Fc_layer是全连接层，Relu是激活层。Step 1: Propose an improved spatio-temporal attention network STA (Spatial-Temporal Attention), and design the topology of the network. The STA network is mainly composed of a spatial attention sub-network SA (Spatial Attention) and a temporal attention sub-network TA (Temporal Attention). Note that T is the total number of time steps processed by the CLSTA network, which is the number of frames of the continuous pictures of the input CLSTA network in this experiment, and T=16 is obtained in the experiment of the present invention, that is, 16 continuous pictures are processed each time. Attached Figure 3 is a block diagram of the CLSTA network topology, Fc_layer is a fully connected layer, and Relu is an activation layer.

空间注意力子网络SA：子网络SA即是空间注意力，空间特征的提取主要通过卷积神经网络CNN实现，CNN网络是基于AlexNet网络，此处的采用AlexNet包含了五个卷积层(con1…con5)和一个全连接层fc6(原AlexNet网络有3个全连接层)，一共六个学习层。SA网络中是双流CNN结构，分别是CNN1和CNN2，用于分别提取当前图片流的空间特征，CNN1、CNN2各有六个学习层。CNN1处理的是当前帧的图片流x_t，将当前图片流x_t(16*227*227，16是每次处理连续的16张图片，227*227是图片尺寸)输入到CNN1中，CNN1的全连接层的输出维度是16*4096；CNN2处理的是上一流的图片x_t-1，将上一流图片x_t-1(16*227*227)输入到CNN2中，CNN的全连接层输出维度是16*4096；再通过一个eltwise层(该层主要是做加、减、乘法操作)，将CNN1全连接层输出的特征维度减去CNN2全连接层的输出特征维度，eltwise层的输出维度是16*4096，接着将eltwise的输出接入到全连接层中，该全连接的输出维度是16*4096。通过这种方法，SA子网络去除了静态的背景干扰，即保留了与上一帧不同的空间特征。Spatial attention subnetwork SA: The subnetwork SA is spatial attention. The extraction of spatial features is mainly realized through the convolutional neural network CNN. The CNN network is based on the AlexNet network. The AlexNet used here includes five convolutional layers (con1 ...con5) and a fully connected layer fc6 (the original AlexNet network has 3 fully connected layers), a total of six learning layers. The SA network has a dual-stream CNN structure, namely CNN1 and CNN2, which are used to extract the spatial features of the current image stream respectively. CNN1 and CNN2 each have six learning layers. CNN1 processes the picture stream x_t of the current frame, and inputs the current picture stream x_t (16*227*227, 16 is 16 consecutive pictures processed each time, 227*227 is the picture size) into CNN1, and CNN1’s The output dimension of the fully connected layer is 16*4096; CNN2 processes the top-level picture x_t-1 , input the top-level picture x_t-1 (16*227*227) into CNN2, and the fully connected layer of CNN outputs The dimension is 16*4096; then through an eltwise layer (this layer is mainly used for addition, subtraction, and multiplication operations), the feature dimension output by the CNN1 fully connected layer is subtracted from the output feature dimension of the CNN2 fully connected layer, and the output dimension of the eltwise layer It is 16*4096, and then the output of eltwise is connected to the fully connected layer. The output dimension of the fully connected is 16*4096. With this approach, the SA subnetwork removes the static background interference, i.e., preserves the spatial features different from the previous frame.

时间注意力子网络TA：TA网络中是双流CNN+LSTM结构，分别是CNN1+LSTM1和CNN2+LSTM2，用于分别提取当前图片流的时序特征。这里的CNN1和CNN2和子网络SA中的是一样的。同理，将当前图片流x_t(16*227*227)输入到CNN1中进行空间特征学习，CNN1的全连接层的输出维度是16*4096，再将CNN1的连接层的输出输入到LSTM1中进行时序学习，LSTM1的输出维度是16*256；将当前的图片的上一帧图片x_t-1(16*227*227)输入到CNN2中进行空间特征学习，CNN2的全连接层输出维度是16*4096，再将CNN2的连接层的输出输入到LSTM2中进行时序学习，LSTM2的输出维度是16*256；再通过一个eltwise层(该层主要是做加、减、乘法操作)，将LSTM1输出的特征维度减去LSTM2输出的特征维度，eltwise层的输出维度是16*256，接着将eltwise的输出接入到全连接层中，该全连接的输出维度是16*256。通过这种方法，TA子网络保留了运动部分的时序特征。Time attention subnetwork TA: The TA network has a dual-stream CNN+LSTM structure, which are CNN1+LSTM1 and CNN2+LSTM2, which are used to extract the timing features of the current image stream. The CNN1 and CNN2 here are the same as those in the subnetwork SA. Similarly, input the current image stream x_t (16*227*227) into CNN1 for spatial feature learning, the output dimension of the fully connected layer of CNN1 is 16*4096, and then input the output of the connected layer of CNN1 into LSTM1 For timing learning, the output dimension of LSTM1 is 16*256; input the previous frame x_t-1 (16*227*227) of the current picture into CNN2 for spatial feature learning, and the fully connected layer output dimension of CNN2 is 16*4096, and then input the output of the connection layer of CNN2 into LSTM2 for timing learning. The output dimension of LSTM2 is 16*256; The feature dimension of the output is subtracted from the feature dimension of the LSTM2 output. The output dimension of the eltwise layer is 16*256, and then the output of the eltwise is connected to the fully connected layer. The output dimension of the fully connected is 16*256. With this approach, the TA sub-network preserves the timing characteristics of the moving parts.

在STA网络中，SA网络包含的主要学习层有14个，分别是：CNN1的六个学习层(5个卷积层+1个全连接层)，CNN2的六个学习层(5个卷积层+1个全连接层)，1个eltwise层，1个全连接层。TA网络包含的主要学习层有4个，分别是：LSTM1，LSTM2，1个eltwise层，1个全连接层。所以STA网络中总共包含的主要学习层是18个。In the STA network, the SA network contains 14 main learning layers, namely: six learning layers of CNN1 (5 convolutional layers + 1 fully connected layer), six learning layers of CNN2 (5 convolutional layers layer + 1 fully connected layer), 1 eltwise layer, 1 fully connected layer. The TA network contains four main learning layers, namely: LSTM1, LSTM2, 1 eltwise layer, and 1 fully connected layer. So the total number of main learning layers included in the STA network is 18.

步骤2：将STA子网络融合Main Convolutional-LSTM Networks，提出CLSTA网络。Main Convolutional-LSTM Networks网络由Main CNN网络和2层LSTM网络级联组成。主要由以下步骤组成：Step 2: Integrate the STA sub-network with Main Convolutional-LSTM Networks to propose a CLSTA network. The Main Convolutional-LSTM Networks network consists of a Main CNN network and a 2-layer LSTM network cascade. It mainly consists of the following steps:

步骤21：将当前图片流x_t(16*227*227)输入到Main CNN中，Main CNN全连接层的输出维度是16*4096，该步骤提取了当前图片流的空间特征。这里的Main CNN和SA中的CNN1以及TA中的CNN1都是都一个网络。这里Main CNN的主要学层的个数不计算(SA中包含了该计算层)。Step 21: Input the current image stream x_t (16*227*227) into the Main CNN. The output dimension of the fully connected layer of the Main CNN is 16*4096. This step extracts the spatial features of the current image stream. Here, Main CNN, CNN1 in SA and CNN1 in TA are all a network. Here, the number of main learning layers of Main CNN is not counted (the calculation layer is included in SA).

步骤22：将SA的输出与Main CNN的输出融合，融合的方式是通过eltwise层(该层主要是做加、减、乘法操作)层做加法，融合后输出的维度也是16*4096。SA保留的是当前帧与上一帧不同的空间特征，SA与Main CNN输出的空间特征融合，突出了空间上不同的部分。所以这里只有eltwise一个学习层。Step 22: Fuse the output of SA with the output of Main CNN. The way of fusion is to add through the eltwise layer (this layer mainly performs addition, subtraction, and multiplication operations). The dimension of the output after fusion is also 16*4096. SA retains the different spatial features of the current frame and the previous frame, and the SA is fused with the spatial features output by the Main CNN to highlight the different parts in space. So there is only one learning layer of eltwise.

步骤23：将Step2融合后输出的特征维度输入到Main LSTM中进行时序特征学习，这里的Main LSTM由2层LSTM网络级联而成，LSTM1的输入即是Step2的输出，LSTM1的输出是16*256；再将LSTM1输出的特征维度输入到LSTM2中，LSTM2的输出维度是16*256。所以这里有2个学习层LSTM1和LSTM2。Step 23: Input the feature dimension output after Step2 fusion into the Main LSTM for time-series feature learning. Here, the Main LSTM is formed by cascading two layers of LSTM networks. The input of LSTM1 is the output of Step2, and the output of LSTM1 is 16* 256; then input the feature dimension output by LSTM1 into LSTM2, and the output dimension of LSTM2 is 16*256. So here we have 2 learning layers LSTM1 and LSTM2.

步骤24：将TA的输出与Step3中的Main LSTM网络的输出进行融合，融合的方式是通过eltwise层做加法。融合后的输出维度是16*256，融合后再接全连接层，最后进行分类，全连接层的输出维度是16*6(16是连续的16张图片，6是分类的类别数，一共有“正常驾驶”、“疲劳驾驶”、“玩手机”、“抽烟”、“离职”、“其他”6类)。SA保留的是当前帧与上一帧不同的时序特征，TA与Main LSTM输出的空间特征融合，突出了时序上不同的部分。这里有eltwise层和全连接层一共2个学习层。Step 24: Fuse the output of TA with the output of the Main LSTM network in Step3, and the way of fusion is to do addition through the eltwise layer. The output dimension after fusion is 16*256, and then the fully connected layer is connected after fusion, and finally classified. The output dimension of the fully connected layer is 16*6 (16 is 16 consecutive pictures, 6 is the number of categories for classification, there are a total of "Normal driving", "fatigue driving", "playing with mobile phones", "smoking", "resignation", "others" 6 categories). SA retains the different timing features of the current frame and the previous frame, and the fusion of TA and the spatial features output by Main LSTM highlights the different parts in timing. There are two learning layers, the eltwise layer and the fully connected layer.

所以在CLSTA中包含的主要学习层有23个。分别是STA中包含的18个学习层，和STA与Main Convolutional-LSTM Networks融合时的5个学习层。So there are 23 main learning layers included in CLSTA. They are the 18 learning layers contained in STA, and the 5 learning layers when STA is integrated with Main Convolutional-LSTM Networks.

步骤3：通过摄像机提取视频，通过脚本程序将提取的视频分解为连续的RGB图像帧，每一秒分解5帧图片。Step 3: Extract the video through the camera, decompose the extracted video into continuous RGB image frames through a script program, and decompose 5 frames of pictures per second.

步骤4：将大量的机车司机行为数据集作为样本数据，输入到CLSTA网络中进行模型训练。其中训练集的图片有12000张，测试集图片4000张，分别包含“正常驾驶”、“疲劳驾驶”、“玩手机”、“抽烟”、“离职”、“其他”6类。CLSTA网络中CNN的权重是基于CaffeNet网络的权重，对于网络的收敛性有很大的帮助，通过训练得到Model。Step 4: Take a large number of locomotive driver behavior data sets as sample data and input them into the CLSTA network for model training. Among them, there are 12,000 pictures in the training set and 4,000 pictures in the test set, including six categories: "normal driving", "fatigue driving", "playing with mobile phones", "smoking", "resignation", and "others". The weight of CNN in the CLSTA network is based on the weight of the CaffeNet network, which is of great help to the convergence of the network, and the Model is obtained through training.

步骤5：将步骤4训练得到CLSTA模型Model，嵌入到工控计算机中，通过该模型实现对机车司机行为识别和理解，在使用中主要由以下几个步骤实现：Step 5: Train the CLSTA model Model obtained in step 4, and embed it into the industrial control computer. Through this model, the behavior recognition and understanding of the locomotive driver can be realized. The following steps are mainly used to realize the use:

步骤51：通过工业摄像机采集环境视频。Step 51: Collect environmental video through industrial cameras.

步骤52：工控计算机中的脚本程序分解视频为图片帧，FPS为5。Step 52: The script program in the industrial computer decomposes the video into picture frames, and the FPS is 5.

步骤53：将分解的每连续16帧送入model中测试。Step 53: Send every 16 consecutive frames decomposed into the model for testing.

步骤54：输出测试结果，并制作报表。Step 54: Output the test results and make a report.

图1为一种基于CLSTN的铁路司机行为识别方法的流程示意图。它的流程包括：Fig. 1 is a schematic flow chart of a method for behavior recognition of railway drivers based on CLSTN. Its processes include:

A、计算机通过接口驱动CCD摄像机获取环境的影像；A. The computer drives the CCD camera through the interface to obtain the image of the environment;

B、将图片分解为RGB图片；B. Decompose the picture into RGB pictures;

C、然后再把RGB图片送入到CLSTA网络中；C. Then send the RGB picture to the CLSTA network;

D、将CLSTA网络的最后结果融合取平均得到最终结果，总结检测结果，形成检测报告。D. The final result of the CLSTA network is fused and averaged to obtain the final result, and the test results are summarized to form a test report.

图2为LSTM网络结构示意图，主要计算公式为：Figure 2 is a schematic diagram of the LSTM network structure, the main calculation formula is:

f_t＝σ(W_f.[h_t-1,x_t]+b_f)f_t ＝σ(W_f .[h_t-1 ,x_t ]+b_f )

i_t＝σ(W_i.[h_t-1,x_t]+b_i)i_t =σ(W_i .[h_t-1 ,x_t ]+b_i )

o_t＝σ(W_o·[h_t-1,x_t]+bo)o_t ＝σ(W_o ·[h_t-1 ,x_t ]+bo)

h_t＝o_t*tanh(C_t)h_t ＝o_t *tanh(C_t )

图3为CLSTA网络的拓扑结构示意图。左边的是Spatial Attention子网络，中间是Main CNN_LSTM Networks主网络，右边是Temporal Attention子网络。Data代表的是输入得数据，每次输入16张图片。图中标有CNN1的是同一个网络，标有CNN2的是也是同一个网络，都是基于Alexnet。Fc_layer是全连接层，Relu是激活层(它不包含为主要学习层).4096为AlexNet中全连接层的维度，即CNN特征的维度，256为LSTM中输出的维度。FIG. 3 is a schematic diagram of a topology structure of a CLSTA network. On the left is the Spatial Attention subnetwork, in the middle is the Main CNN_LSTM Networks main network, and on the right is the Temporal Attention subnetwork. Data represents the input data, 16 pictures are input each time. The one marked CNN1 in the figure is the same network, and the one marked CNN2 is also the same network, both based on Alexnet. Fc_layer is the fully connected layer, Relu is the activation layer (it is not included as the main learning layer). 4096 is the dimension of the fully connected layer in AlexNet, that is, the dimension of CNN features, and 256 is the dimension of the output in LSTM.

Claims

1. a kind of railway drivers Activity recognition method based on CLSTA, which is characterized in that include the following steps：

Step 1：The characteristics of according to the indoor environment of driver and driver's common behavior, improved space-time attention network STA is established,And the topological structure of planned network；The improved space-time attention network STA includes spatial attention sub-network SA and timeAttention sub-network TA；

Step 2：Spatial attention sub-network SA and time attention sub-network TA are merged into Main LSTM network, obtained newCLSTA neural network model, and the topological structure of planned network；The Main LSTM network is by Main CNN network and two layersLSTM cascade composition；

Step 3：Using the common behavior video acquisition sample of trainman as data set, it is input to the CLSTA neural networkIn model, training model；Obtained model is applied in industrial control computer, the monitoring identification of trainman's behavior is carried out.

2. the railway drivers Activity recognition method according to claim 1 based on CLSTA, which is characterized in that the spaceAttention sub-network SA realizes the extraction of space characteristics by the convolutional neural networks CNN based on AlexNet network, describedAlexNet network includes five convolutional layers and a full articulamentum fc6, totally six learning layers；The spatial attention sub-networkSA is double fluid CNN structure, is CNN1 and CNN2 respectively, for extracting the space characteristics of current image stream respectively, CNN1, CNN2 are eachThere are six learning layers；That CNN1 is handled is the picture stream x of present frame_t, by current image frame x_tIt is input in CNN1；CNN2 processingThe picture x of previous frame_t-1, by the picture x of previous frame_t-1It is input in CNN2；Subtraction operation is carried out by an eltwise again,The CNN1 characteristic dimension exported is subtracted to the output characteristic dimension of CNN2, eltwise layers of output meets a full articulamentum Fc_In layer1.

3. the railway drivers Activity recognition method according to claim 1 based on CLSTA, which is characterized in that the timeIt is double-current CNN+LSTM structure in attention sub-network TA, is CNN1+LSTM1 and CNN2+LSTM2 respectively, for extracting respectivelyThe temporal aspect of current image stream；By current image stream x_tIt is input in CNN1 progress space characteristics study, then by the defeated of CNN1It is input to progress timing study in LSTM1 out；By the previous frame picture x of current picture_t-1It is input in CNN2 and carries out space spySign study, then the output of CNN2 is input to progress timing study in LSTM2；Pass through an eltwise layers of progress subtraction behaviour againMake, the LSTM1 characteristic dimension exported is subtracted to the characteristic dimension of LSTM2 output, eltwise layers of output is then linked into oneIn a full articulamentum Fc_layer2.

4. the railway drivers Activity recognition method according to claim 1 based on CLSTA, which is characterized in that the step 2Specific steps include：

Step 21：By current image stream x_tIt is input in Main CNN, extracts the space characteristics of current image stream；

Step 22：The output of spatial attention sub-network SA is merged with the output of Main CNN, the mode of fusion is to pass throughEltwise layers are done add operation；

Step 23：The characteristic dimension exported after step 22 fusion is input to progress temporal aspect study in Main LSTM network,The Main LSTM network is formed by 2 layers of LSTM cascade, and the input of LSTM1 is the output of step 22；Again by LSTM1The characteristic dimension of output is input in LSTM2；

Step 24：The output of time attention sub-network TA is merged with the output of the Main LSTM network in step 23,The mode of fusion is to do addition by eltwise layers；Fc_layer3 is met after fusion again, is finally classified.

5. the railway drivers Activity recognition method according to claim 1 based on CLSTA, which is characterized in that the step 3Specific steps include：

Step 31：Ambient video is acquired by industrial camera；

Step 32：It is picture frame, FPS 5 that shell script in industrial control computer, which decomposes video,；

Step 33：Being sent into model per continuous 16 frame for decomposition is tested；

Step 34：Output test result, and makes report.