CN111241162A

Movatterモバイル変換

Info

Publication number: CN111241162A
Application number: CN202010049105.1A
Authority: CN
Inventors: 徐瑞华; 朱炜; 翟学皓
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-05

Abstract

The invention relates to a method for analyzing the travel behavior of passengers under the condition of high-speed railway network formation, which comprises the following steps: step 1: obtaining economic data of a region and passenger ticket data of the region; step 2: removing unreasonable data in the passenger ticket data; and step 3: integrating regional economic data and passenger ticket data into a data set, and preprocessing the data set; and 4, step 4: clustering the data set for multiple times to obtain an alternative clustering subset; and 5: performing clustering integration on the alternative clustering subsets to obtain a final clustering result; step 6: and finishing the analysis of the travel behaviors of the passengers according to the finally screened clustering results. Compared with the prior art, the method has the advantages of higher objectivity, more detailed behavior analysis, higher implementation speed and the like.

Description

Translated fromChinese

高速铁路成网条件下旅客出行行为分析方法及存储介质Analysis method and storage medium of passenger travel behavior under the condition of high-speed railway network

技术领域technical field

本发明涉及轨道交通技术领域，尤其是涉及一种高速铁路成网条件下旅客出行行为分析方法。The invention relates to the technical field of rail transit, in particular to a method for analyzing the travel behavior of passengers under the condition of forming a network of high-speed railways.

背景技术Background technique

传统的出行行为分析方法通常基于旅客行为选择模型得出，首先建立旅客出行离散选择模型，进而通过SP问卷调查推定模型中的参数，得出不同方式分配的比例。该方法对SP问卷调查的要求较高，需要保证调查者能够清楚的反映自己意愿，在问题和答案设计上也要保证相互独立性和没有明显的倾向性。因此，该方法具有一定的主观性，并受限于调查数据的可靠性，无法真实地反映客观存在的出行行为。The traditional travel behavior analysis method is usually based on the passenger behavior selection model. First, the discrete travel choice model of passengers is established, and then the parameters in the model are estimated through the SP questionnaire, and the proportion of different allocation methods is obtained. This method has higher requirements for SP questionnaires, and it needs to ensure that the investigators can clearly reflect their wishes, and the design of questions and answers should also ensure mutual independence and no obvious tendency. Therefore, this method has a certain subjectivity and is limited by the reliability of the survey data, and cannot truly reflect the objective travel behavior.

客票数据是指通过12306网站或app以及高速铁路车站售票窗口所购买的车票信息，在铁路客票系统中以订单数据的形式保存。订单数据中包括订单发生的事件id、订购的车次、起讫点、座席类型、订购数量等字段信息，不直接提供旅客的个人属性。因此，客票数据无法应用于传统的出行行为分析方法中。Passenger ticket data refers to the ticket information purchased through the 12306 website or app and the ticket window of a high-speed railway station, which is saved in the form of order data in the railway passenger ticket system. The order data includes field information such as the event id of the order, the number of trains ordered, the origin and destination, the seat type, and the order quantity, and does not directly provide the personal attributes of the passenger. Therefore, ticket data cannot be used in traditional travel behavior analysis methods.

综上，现阶段出行行为分析存在以下缺陷：To sum up, the current travel behavior analysis has the following shortcomings:

1、传统的出行行为分析方法具有一定的主观性，并受限于调查数据的可靠性，无法真实地反映客观存在的出行行为。1. Traditional travel behavior analysis methods are subject to a certain degree of subjectivity, and are limited by the reliability of survey data, and cannot truly reflect objective travel behaviors.

2、客票数据为旅客出行时产生的客观数据，但无法应用于传统的出行行为分析中。2. Ticket data is objective data generated by passengers when they travel, but cannot be used in traditional travel behavior analysis.

发明内容SUMMARY OF THE INVENTION

本发明的目的就是为了克服上述现有技术存在的缺陷而提供一种更具客观性、实现速度更快的高速铁路成网条件下旅客出行行为分析方法。The purpose of the present invention is to provide a more objective and faster method for analyzing the travel behavior of passengers under the condition of high-speed railway network formation in order to overcome the above-mentioned defects of the prior art.

本发明的目的可以通过以下技术方案来实现：The object of the present invention can be realized through the following technical solutions:

一种高速铁路成网条件下旅客出行行为分析方法，该方法为一种内嵌在计算机中的程序，包括以下步骤：A method for analyzing the travel behavior of passengers under the condition of forming a high-speed railway network, the method is a program embedded in a computer, and includes the following steps:

步骤1：获取地区经济数据和该地区的客票数据；Step 1: Obtain regional economic data and passenger ticket data in the region;

步骤2：剔除客票数据中的不合理数据；Step 2: Eliminate unreasonable data in the ticket data;

步骤3：将地区经济数据和客票数据融合为一个数据集，并对该数据集进行预处理；Step 3: Integrate regional economic data and passenger ticket data into a dataset, and preprocess the dataset;

步骤4：对数据集进行多次聚类，获得备选聚类子集；Step 4: Perform multiple clustering on the data set to obtain candidate cluster subsets;

步骤5：对备选聚类子集进行聚类集成，获得最终的聚类结果；Step 5: Perform clustering integration on the candidate clustering subsets to obtain the final clustering result;

步骤6：根据步骤5获得的聚类结果完成对旅客出行行为的分析。Step 6: Complete the analysis of the travel behavior of passengers according to the clustering results obtained inStep 5.

优选地，所述步骤2中的不合理数据包括退票数据和下单未购票数据。Preferably, the unreasonable data in thestep 2 includes refund data and unpurchased data when the order is placed.

优选地，所述步骤3中数据集预处理方法具体为：Preferably, the data set preprocessing method in thestep 3 is specifically:

首先采用最大-最小规范化方法对原始数据进行规范化处理，然后采用Z-score方法对数据集进行标准化处理。The original data is first normalized by the max-min normalization method, and then the dataset is normalized by the Z-score method.

优选地，所述步骤4的具体步骤为：Preferably, the specific steps ofstep 4 are:

步骤4-1：确定k-means聚类方法所使用的最优k值；Step 4-1: Determine the optimal k value used by the k-means clustering method;

步骤4-2：使用步骤4-1所确定的k值和随机初始聚类中心对数据集进行多次聚类，获得聚类子集。Step 4-2: Use the k value determined in Step 4-1 and the random initial cluster center to perform multiple clustering on the data set to obtain a cluster subset.

更加优选地，其特征在于，所述步骤4-1中最优k值采用轮廓系数法获取，具体为：使用枚举法列举k值，并计算在该k值下每个样本点的轮廓系数值和所有样本点的平均轮廓系数值，最大的平均轮廓系数值对应的k值即为最优k值；More preferably, it is characterized in that the optimal k value in the step 4-1 is obtained by using the contour coefficient method, specifically: enumerating the k value by using the enumeration method, and calculating the contour coefficient of each sample point under the k value value and the average silhouette coefficient value of all sample points, the k value corresponding to the largest average silhouette coefficient value is the optimal k value;

所述的轮廓系数的计算方法为：The calculation method of the profile coefficient is:

其中，

是个体节点X_j的轮廓系数，a_k,i为样本j到簇C_k中节点的平均距离，b_k,i为样本j到簇C_k以外的簇中心的平均距离；in,

is the silhouette coefficient of the individual node X_j , a_k,i is the average distance from the sample j to the nodes in the cluster C_k , b_k,i is the average distance from the sample j to the center of the cluster other than the cluster C_k ;

所述的平均轮廓系数的计算方法为：The calculation method of the average silhouette coefficient is:

优选地，所述步骤5中具体为：Preferably, thestep 5 is specifically:

首先使用投票法获得聚类集成结果，然后使用平均标准化互信息对聚类集成结果进行检验。The cluster ensemble results are first obtained using the voting method, and then the cluster ensemble results are tested using the average normalized mutual information.

更加优选地，所述的平均标准化互信息的计算方法为：More preferably, the method for calculating the average normalized mutual information is:

令

为在第Y_i个备选聚类子集中，属于簇C的节点集，设备选聚类子集Y_i和Y_k之间的交互信息为：make

In order to be the node set belonging to cluster C in the Y_i -th candidate cluster subset, the interaction information between the equipment selection cluster subset_{Yi and Y k}_is :

其中，N为样本总数，备选聚类子集Y_i的信息熵为：Among them, N is the total number of samples, and the information_entropy of the candidate cluster subset Yi is:

平均标准化互信息ANMI的计算方法为：The calculation method of the average normalized mutual information ANMI is:

其中，

为此次聚类中聚类结果的集合。in,

is the set of clustering results in this clustering.

更加优选地，所述的

的取值为[0,1]，

的值越大，聚类集成效果越好。More preferably, the said

The value of is [0,1],

The larger the value of , the better the clustering integration effect.

一种存储介质，该存储介质存储有所述的分析方法的计算机程序。A storage medium storing the computer program of the analysis method.

与现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

一、实现速度快：本发明中的旅客出行行为分析方法通过计算机实现，相较于传统的人为的实现方式，本发明中的分析方法实现旅客行为分析速度更快。1. Fast implementation speed: The passenger travel behavior analysis method in the present invention is realized by a computer. Compared with the traditional artificial implementation, the analysis method in the present invention realizes the passenger behavior analysis speed faster.

二、更具客观性：本发明中的旅客出行行为分析方法考虑了旅客的客票信息，并且将客票信息与地区经济信息相关联，相较于传统的问卷调查方式，本发明对旅客的出行行为分析更具客观性。2. More objectivity: the passenger travel behavior analysis method in the present invention considers the passenger ticket information, and associates the passenger ticket information with the regional economic information. Compared with the traditional questionnaire survey method, the present invention analyzes the travel behavior of passengers. Analysis is more objective.

三、行为分析更加细化：本发明中的旅客出行行为分析方法使用基于投票法的聚类集成的方式进行分析，最终将旅客出行行为总结为工作型、休闲型、商务型、高端型和经济型五类，对旅客出行行为的划分更加细化，为铁路部门的运营和决策提供了数据基础。3. The behavior analysis is more refined: the passenger travel behavior analysis method in the present invention uses the clustering integration method based on the voting method for analysis, and finally summarizes the travel behavior of passengers into work, leisure, business, high-end and economical There are five types of passenger travel behavior, which provides a data basis for the operation and decision-making of the railway department.

附图说明Description of drawings

图1为本发明的流程示意图；Fig. 1 is the schematic flow chart of the present invention;

图2为本发明实施例中从出行时间因素和地域因素分析的聚类结果图；Fig. 2 is the clustering result diagram analyzed from the travel time factor and the geographical factor in the embodiment of the present invention;

图3为本发明实施例从出行时间因素和个人消费因素分析的聚类结果图；Fig. 3 is a clustering result diagram analyzed from travel time factor and personal consumption factor according to an embodiment of the present invention;

图4为本发明实施例从个人消费因素和地域因素分析的聚类结果图；FIG. 4 is a clustering result diagram analyzed from personal consumption factors and regional factors according to an embodiment of the present invention;

图5为本发明实施例中五个聚类结果的购票提前期分布图；Fig. 5 is the distribution diagram of the lead time of ticket purchase of five clustering results in the embodiment of the present invention;

图6为本发明实施例中五个聚类结果的出行日期的分布图；6 is a distribution diagram of travel dates of five clustering results in an embodiment of the present invention;

图7为本发明实施例中五个聚类结果的发车时间的分布图；7 is a distribution diagram of the departure times of five clustering results in an embodiment of the present invention;

图8为本发明实施例中五个聚类结果的抵达时间的分布图；8 is a distribution diagram of the arrival times of five clustering results in an embodiment of the present invention;

图9为本发明实施例中五个聚类结果的单位运价的分布图；Fig. 9 is the distribution diagram of the unit freight rate of five clustering results in the embodiment of the present invention;

图10为本发明实施例中五个聚类结果的售票模式选择的分布图；FIG. 10 is a distribution diagram of ticket sales mode selection of five clustering results in an embodiment of the present invention;

图11为本发明实施例中五个聚类结果的起讫点所在城市人均GDP的分布图。FIG. 11 is a distribution diagram of per capita GDP of cities where the starting and ending points of five clustering results are located in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都应属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

本发明涉及一种高速铁路成网条件下旅客出行行为分析方法，该方法为一种内嵌在计算机中的程序，具体流程如图1所示，包括以下步骤：The present invention relates to a method for analyzing the travel behavior of passengers under the condition of high-speed railway network formation. The method is a program embedded in a computer. The specific process is shown in Figure 1 and includes the following steps:

步骤2：剔除客票数据中的不合理数据，不合理数据包括退票数据和下单未购票数据；Step 2: Eliminate the unreasonable data in the ticket data, including the refund data and the unpurchased data of the order;

具体融合方法为：根据客票数据中每个订单的起点和终点，将起点终点与地区经济数据相关联；The specific fusion method is: according to the starting point and ending point of each order in the ticket data, associating the starting point and ending point with the regional economic data;

对数据进行预处理的方法为：首先采用最大-最小规范化方法对原始数据进行规范化处理，然后采用Z-score方法对数据集进行标准化处理；The method of data preprocessing is as follows: first normalize the original data with the max-min normalization method, and then use the Z-score method to normalize the data set;

归一化：要把需要处理的数据经过处理后限制在一定范围内。首先归一化是为了后面数据处理的方便，其次是保证不同字段按照相同的标准被处理。一般指将数据限制在[0,1]之间。本实施例采用最大-最小规范化对原始数据进行线性变换：Normalization: The data to be processed should be limited to a certain range after processing. First, normalization is for the convenience of later data processing, and secondly, it is to ensure that different fields are processed according to the same standard. Generally refers to restricting the data between [0,1]. This embodiment uses max-min normalization to linearly transform the original data:

其中，X_i,j为数据表第i行的第j项指标。Among them, X_i,j is the j-th index of the i-th row of the data table.

标准化：对原始数据进行缩放处理，将其限制在一定的范围内，一般指正态化处理。即使数据不符合正态分布，也可以采用这种方法对数据进行标准化处理，标准化后的数据有正有负。本实施例采用Z-score规范化方法对数据集进行处理，即将数据集变为均值为0，方差为1的正态分布：Standardization: scaling the original data to limit it to a certain range, generally referred to as normalization. Even if the data does not conform to a normal distribution, this method can be used to standardize the data, and the standardized data can be positive or negative. This embodiment uses the Z-score normalization method to process the data set, that is, the data set becomes a normal distribution with a mean of 0 and a variance of 1:

其中，μ_j为第j项指标的平均值，σ_j为第j项指标的标准差；Among them, μ_j is the average value of the j-th index, and σ_j is the standard deviation of the j-th index;

步骤4具体为：Step 4 is specifically:

步骤4-1：确定k-means聚类方法所使用的最优k值，本实施例采用轮廓系数法获取最优k值，具体为：使用枚举法列举k值，并计算在该k值下每个样本点的轮廓系数值和所有样本点的平均轮廓系数值，最大的平均轮廓系数值对应的k值即为最优k值；Step 4-1: Determine the optimal k value used by the k-means clustering method. In this embodiment, the silhouette coefficient method is used to obtain the optimal k value, specifically: enumerating the k value using the enumeration method, and calculating the k value at the The contour coefficient value of each sample point and the average contour coefficient value of all sample points below, the k value corresponding to the largest average contour coefficient value is the optimal k value;

轮廓系数的计算方法为：The calculation method of the silhouette coefficient is:

其中，

平均轮廓系数的计算方法为：The average silhouette coefficient is calculated as:

步骤4-2：使用固定的k值和随机初始聚类中心对数据集进行10次聚类，获得聚类子集，通常，具有不同初始化的k-means的生成机制是以固定的k值和不同的初始聚类中心运行若干次，具体步骤为：Step 4-2: Cluster thedataset 10 times using a fixed value of k and random initial cluster centers to obtain a subset of clusters. Generally, the generation mechanism of k-means with different initializations is based on a fixed value of k and Different initial cluster centers are run several times, and the specific steps are:

1、选择一个指标子集

将数据集U作为输入；1. Choose a subset of metrics

Take the dataset U as input;

2、随机选择k个初始聚类中心；2. Randomly select k initial cluster centers;

3、计算样本点与聚类中心的距离，然后将样本点划分到最近的簇中，样本点与聚类中心距离的具体计算方法为：3. Calculate the distance between the sample point and the cluster center, and then divide the sample point into the nearest cluster. The specific calculation method of the distance between the sample point and the cluster center is:

其中，i＝1,2,...,k，m为数据集和聚类中心的数据维度；Among them, i=1,2,...,k, m is the data dimension of the dataset and the cluster center;

4、根据簇中已有的样本点，更新簇中心的位置，更新方法为：4. According to the existing sample points in the cluster, update the position of the cluster center. The update method is:

其中，|C_i|为簇中样本的数量；where |C_i | is the number of samples in the cluster;

5、重复步骤4，直至簇中心不再发生变化或目标函数达到最小值，完成一次聚类，目标函数具体为：5.Repeat step 4 until the cluster center no longer changes or the objective function reaches the minimum value, and a clustering is completed. The objective function is as follows:

步骤5：对备选聚类子集进行聚类集成，获得最终的聚类结果，本实施例采用投票法获得聚类集成结果，然后使用平均标准化互信息对聚类集成结果进行检验；Step 5: perform clustering integration on the candidate clustering subsets to obtain the final clustering result. In this embodiment, the voting method is used to obtain the clustering integration result, and then the average standardized mutual information is used to test the clustering integration result;

平均标准化互信息的计算方法为：The average normalized mutual information is calculated as:

令

其中，

为此次聚类中聚类结果的集合。in,

is the set of clustering results in this clustering.

的取值为[0,1]，

的值越大，表示聚类集成效果越好。

The value of is [0,1],

The larger the value is, the better the clustering integration effect is.

步骤6：根据最终筛选出的聚类结果完成对旅客出行行为的分析。Step 6: Complete the analysis of the travel behavior of passengers according to the final filtered clustering results.

本发明还涉及一种存储有上述方法对应程序的存储介质。The present invention also relates to a storage medium storing a program corresponding to the above method.

以下为本发明中旅客出行行为分析方法一种具体实施例。The following is a specific embodiment of the passenger travel behavior analysis method in the present invention.

空间范围：以上海局下午内车主车站为起讫点的所有高速铁路列车包括G字头列车和D字头列车。Spatial scope: All high-speed railway trains starting and ending at the owner's station in the afternoon of the Shanghai Bureau include G trains and D trains.

时间范围：2017年3月1日至2017年3月20日。Time frame: March 1, 2017 to March 20, 2017.

数据规模：约620000条有效数据，对该数据进行预处理，即提出退票以及下单未购票的记录。Data scale: about 620,000 pieces of valid data, pre-processing the data, that is, the record of refunding and placing an order without purchasing a ticket.

基本字段：订购日期，发车日期，车次号，始发站，终点站，票面起始站，票面终到站，始发时间，席位类别，销售模式，票数，票额收入。Basic fields: order date, departure date, train number, departure station, destination station, ticket start station, ticket end station, departure time, seat category, sales mode, number of tickets, ticket revenue.

补充字段：起讫点的经济、人口、社会属性。Supplementary fields: economic, demographic, and social attributes of the origin and destination.

可获取的有关旅客出行特征的字段：购票提前期，出发日的(星期，月份)，出发时段，行程时长，终到时段，席位类别，订单票数，票额收入。Available fields related to passenger travel characteristics: ticket advance period, departure date (week, month), departure period, itinerary duration, final arrival period, seat type, order ticket number, ticket revenue.

将客票数据与地区经济数据进行融合后，对数据进行预处理，然后使用k-means聚类方法对数据进行多次聚类，获得的聚类效果如表1所示。After the passenger ticket data and regional economic data are fused, the data is preprocessed, and then the data is clustered multiple times using the k-means clustering method. The obtained clustering effect is shown in Table 1.

表1聚类效果说明Table 1 Description of clustering effect

由于各指标之间存在相关性，因此有可能用较少的综合指标反映原始指标的大部分信息。本实施例采用主成分分析法从原始指标中提取公因子，并选取特征值大于1的因子作为公共因子。经过主成分分析后可获得6个新的综合指标来代表原有的11个变量，6个新的指标可以解释85.06％的信息。Because of the correlation between the indicators, it is possible to use fewer comprehensive indicators to reflect most of the information of the original indicators. In this embodiment, the principal component analysis method is used to extract the common factor from the original index, and the factor with the eigenvalue greater than 1 is selected as the common factor. After principal component analysis, 6 new comprehensive indicators can be obtained to represent the original 11 variables, and the 6 new indicators can explain 85.06% of the information.

下面进行聚类集成，聚类集成的结果如表2所示。The clustering integration is performed below, and the results of the clustering integration are shown in Table 2.

表2聚类集成结果Table 2 Clustering integration results

由上表可知，本实施例中采用投票法所得到的聚类集成结果，平均标准互信息的计算值为0.67，在所有的聚类集成结果中集成效果最好。It can be seen from the above table that in the clustering integration result obtained by the voting method in this embodiment, the calculated value of the average standard mutual information is 0.67, and the integration effect is the best among all the clustering integration results.

从出行时间因素、地域因素和个人消费因素对聚类效果进行分析，分析结果如图2-4所示，可以看出本实施例中的聚类结果之间有较为清晰的界限，聚类效果合理。The clustering effect is analyzed from the travel time factor, regional factor and personal consumption factor. The analysis result is shown in Figure 2-4. It can be seen that the clustering results in this embodiment have clear boundaries, and the clustering effect Reasonable.

最终获得的五个簇所对应的客流类别的分布特点如表5所示。The distribution characteristics of the passenger flow categories corresponding to the five clusters finally obtained are shown in Table 5.

表5table 5

图5-11分别为上表中五个聚类结果的订票提前期分布图、发车日期分布图、出发时段分布图、抵达时段分布图、单位运价分布图、售票模式选择分布图和经济水平分布图，由图可以总结出五个聚类结果的具体特征，如表6所示。Figure 5-11 shows the distribution of booking lead time, departure date, departure time, arrival time, unit freight rate, ticketing mode selection, and economy of the five clustering results in the table above. The horizontal distribution map, from which the specific characteristics of the five clustering results can be summarized, as shown in Table 6.

表6Table 6

经过聚类，最终将旅客出行行为总结为工作型、休闲型、商务型、高端型和经济型五类。After clustering, the travel behavior of passengers is finally summarized into five categories: work, leisure, business, high-end and economic.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到各种等效的修改或替换，这些修改或替换都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed by the present invention. Modifications or substitutions should be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method for analyzing travel behaviors of passengers under a high-speed railway network forming condition is characterized by being a program embedded in a computer, and the method for analyzing the travel behaviors of the passengers comprises the following steps:

step 1: obtaining economic data of a region and passenger ticket data of the region;

step 2: removing unreasonable data in the passenger ticket data;

and step 3: integrating regional economic data and passenger ticket data into a data set, and preprocessing the data set;

and 4, step 4: clustering the data set for multiple times to obtain an alternative clustering subset;

and 5: performing clustering integration on the alternative clustering subsets to obtain a final clustering result;

step 6: and (5) completing the analysis of the travel behavior of the passenger according to the clustering result obtained in the step (5).

2. The method for analyzing the traveling behavior of passengers under the condition of netting by a high-speed railway as claimed in claim 1, wherein the unreasonable data in the step 2 comprises data of returning tickets and data of placing orders and not purchasing tickets.

3. The method for analyzing the traveling behavior of passengers under the condition of high-speed railway network formation according to claim 1, wherein the data set preprocessing method in the step 3 specifically comprises the following steps:

firstly, the raw data is normalized by adopting a maximum-minimum normalization method, and then the data set is normalized by adopting a Z-score method.

4. The method for analyzing the traveling behavior of passengers under the condition of the high-speed railway network formation according to claim 1, wherein the specific steps of the step 4 are as follows:

step 4-1: determining an optimal k value used by a k-means clustering method;

step 4-2: and (4) clustering the data set for multiple times by using the k value determined in the step (4-1) and the random initial clustering center to obtain a clustering subset.

5. The method for analyzing the traveling behavior of passengers under the high-speed railway network formation condition according to claim 4, wherein the optimal k value in the step 4-1 is obtained by a contour coefficient method, which specifically comprises the following steps: enumerating k values by using an enumeration method, and calculating the profile coefficient value of each sample point and the average profile coefficient value of all the sample points under the k values, wherein the k value corresponding to the maximum average profile coefficient value is the optimal k value;

the method for calculating the contour coefficient comprises the following steps:

wherein,

is an individual node X_jCoefficient of contour of (a)_k,iFor samples j to cluster C_kAverage distance of middle nodes, b_k,iFor samples j to cluster C_kOutside the center of the clusterThe average distance;

the calculation method of the average contour coefficient comprises the following steps:

6. the method for analyzing the traveling behavior of passengers under the condition of the high-speed railway network formation according to claim 1, wherein the step 5 specifically comprises the following steps:

firstly, a voting method is used for obtaining a clustering integration result, and then average standardized mutual information is used for checking the clustering integration result.

7. The method for analyzing the traveling behavior of passengers under the condition of the high-speed railway network formation according to claim 6, wherein the method for calculating the average normalized mutual information comprises the following steps:

order to

Is at the Y th_iIn the alternative cluster subset, the node set belonging to the cluster C, and the equipment selects the cluster subset Y_iAnd Y_kThe mutual information between them is:

wherein N is the total number of samples, and the alternative clustering subset Y_iThe information entropy of (a) is:

the average normalized mutual information ANMI is calculated by the following method:

wherein,

is the set of clustering results in this clustering.

8. The method according to claim 7, wherein the method for analyzing the traveling behavior of passengers under the condition of high-speed railway netting is characterized in that

Is taken as value of [0,1]，

The larger the value of (A), the better the clustering effect.

9. A storage medium storing a computer program of the analysis method according to claim 1.