Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a method for identifying the departure point of a bus IC card passenger when a travel chain breaks, which has high accuracy for identifying and distinguishing the departure point of a common bus when the travel chain breaks, has wide application range and can meet the actual requirements in engineering application.
The technical scheme of the invention is as follows:
a method for identifying the bus stop of a bus IC card passenger when a travel chain breaks comprises the following steps:
1) According to IC card swiping data and operation vehicle data of the conventional bus, a first layer of Stacking frame is used for identifying the stop point of the passenger who swipes the IC card of the conventional bus;
2) And 3) taking the identification result in the step 1) as input, and identifying the stop of the passenger getting off the bus IC card by using a second-layer Stacking framework based on a Logistic regression model.
Preferably, in step 1), there is providedThe trip of the mth passenger on the d-th day and the b-th trip chain break is the J-th trip in the T-th class J station in the f direction of the l route1 Personal siteGet on the car, the travel at the j-th place is obtained through identification2 Probability of getting off for each possible get off station, where j1 <j2 <J;
Identifying the station-off point of a conventional bus IC card swiping passenger by adopting one or more of a method based on a personal high-frequency station, a method based on a downstream station attraction, a method based on transfer convenience probability, a method based on a land property attraction probability and a method based on a group history record by using a first layer of Stacking framework, and respectively obtaining the station-off point of the conventional bus IC card swiping passenger at the j2 Station points of possible departureGet-off probability->
Preferably, in step 1), the method for determining the probability of getting off a possible get-off station based on the method of the personal high frequency station is as follows:
statistics of mth passenger during study period D day, at jth2 Station points of possible departureThe total number of times of card swiping on the car is->The next trip is at j2 Possible departure stops->The probability of getting off is as follows:
preferably, in step 1), the method for determining the probability of getting off a possible get-off station based on the attraction of the downstream station is as follows:
statistics of mth passenger at jth1 Personal siteIn the bus shift of the bus, at the j2 Possible departure stops->The total number of times of card swiping on the car is->The next trip is at j2 Possible departure stops->The probability of getting off is as follows:
preferably, in step 1), the method for determining the getting-off probability of the possible getting-off station based on the transfer convenience probability is as follows:
statistics of j according to bus static line station information2 Station points of possible departureBus route number->The next trip is at j2 Possible departure stops->The probability of getting off is as follows:
and, in addition, the method comprises the steps of,
preferably, in step 1), the method for determining the getting-off probability of the possible getting-off station based on the method of the land property attraction probability is as follows:
Let j be2 Station points of possible departureH city construction land types are shared in the surrounding research areas of the road, and the road is in the j th place2 Possible departure stops->The probability of getting off is as follows:
wherein C ish For the H e {1, 2.,. Sup.H } city construction land type attraction coefficient,for possibly getting off the station>Around h city construction land type.
Preferably, in step 1), the method for determining the probability of getting off a possible get-off station based on the group history method is as follows:
a) Clustering the bus IC card data of the identified departure points into clusters, taking the bus IC card data of the same cluster for identifying the departure points based on a travel chain method as a history group record, and determining the departure points to be identified;
b) Constructing a clustering index, wherein the clustering index comprises two types, and the first type is related fields in the bus IC card data of the identified station points and is used for recording data generated by each card swiping; the second class is a plurality of indexes constructed according to the first class clustering indexes and actual conditions and used for mining the similarity among different IC card data;
c) And (3) selecting a plurality of clustering indexes, normalizing the selected clustering indexes, scaling the clustering indexes by adopting maximum and minimum standardization, enabling the index value to be located between a given minimum value and a given maximum value, and scaling the characteristic value of each clustering index to the unit size.
D) Clustering based on K-Means algorithm, and combining elbow rule to determine the best clustering class number CG Obtaining a card swiping travel mode
E) Trip data for setting the mth passenger on the d-th day and the b-th trip chain to break belong to a clusterThen cluster->Determining records of the departure station points as a group history record data set based on a travel chain method; and determining to get on and at j according to the group history data set2 Possible departure stops->The frequency of getting off is +.>The next trip is at j2 Possible departure stops->The get-off probability of (2) is as follows:
Preferably, the method is characterized in that the step 2) is specifically as follows:
2.1 A model is built, the possible get-off station points obtained through identification in the step 1) are respectively marked as 0 or 1, the possible get-off station points marked as 1 are the identified correct get-off station points, the possible get-off station points marked as 0 are the incorrect get-off station points, and the correct get-off station points and the incorrect get-off station points are used as input of a Logistic regression model of a second layer of Stacking framework;
for the travel of the mth passenger on the d-th day and the b-th travel chain break, outputting the Logistic regression model as the j-th travel2 The probability of getting off a potential get off station is as follows:
wherein,,is an input vector, +. >Pm,d,b (j1 ,j2 ) Is->One or more of the following; />Is a weight vector, ">W is W1 、w2 、w3 、w4 、w5 One or more of which are respectively represented by +.>Weights, w0 Is biased;
2.2 Identifying bus IC card data of a departure station point by using a travel chain-based method, and learning a model by taking the bus IC card data as a training set and a test set;
2.3 Selecting a maximum likelihood estimation method to estimate model parameters, and adopting an L-BFGS algorithm suitable for large-scale data calculation to determine parameter values; then at j2 Station points of possible departureThe probability of getting off is as follows:
wherein,,is->Maximum likelihood estimates of (a);
2.4 (m) the trip station point of trip with broken travel chain of the mth passenger on the d-th dayIs the j-th with the highest probability of getting off in the possible getting-off station2 Possible departure stops->The method comprises the following steps:
preferably, in step 2.2), if the number of incorrect alighting stations is greater than the number of correct alighting stations, the following steps are performed:
2.2.1 Random undersampling is adopted for the data of the incorrect get-off station, and the data are combined with the original data of the correct get-off station;
2.2.2 Oversampling based on SMOTE algorithm is performed on the data of the correct departure station, and the oversampling is combined with the original data of the incorrect departure station;
2.2.3 Combining the data obtained after the step 2.2.1) and the step 2.2.2), selecting 90% of the data as a training data set of the Logistic regression model, and the remaining 10% as a test set of the Logistic regression model.
Preferably, in step 2.4), after determining the number of the station of the departure station, the name of the station of the departure station and the longitude and latitude are determined by combining the station information of the static bus route.
The beneficial effects of the invention are as follows:
the method for identifying the bus IC card passengers when the travel chain breaks comprises the steps of comprehensively using a two-layer Stacking framework based on a centralized meter model and a non-centralized meter model, wherein the first-layer Stacking framework uses a personal high-frequency station method, a downstream station attraction method, a transfer convenience probability-based method, a land property attraction probability-based method and a group history record-based method. The second layer of Stacking framework can effectively determine the weight of each method in the first layer for different data sets by using a Logistic regression model, and the obtained model parameters are more suitable for the data sets, have better generalization capability and have beneficial effects on the recognition accuracy; the early peak period of working day is the time period that the passenger goes out the most regularly, and the more the passenger swipes the card, the more regular the trip behavior.
The invention provides a method based on group history in a first layer of Stacking framework to improve the recognition probability of a get-off station, and overcomes the defect of less personal history travel records in the method based on the personal history. The method based on the personal history record needs to be based on the record of the historical travel behaviors of the passengers on the same line and at the same station, and the identification of the passenger departure station is carried out according to the similar travel behavior rules of the personal history.
Compared with the traditional KNN-based method, decision tree-based method and random forest-based method, the method provided by the invention adopts a two-layer Stacking framework method, and can identify the off-board stations of the IC card data when all travel chains are broken. According to the invention, different weights of the multiple methods of the first layer are determined by using the Logistic regression model in the second layer, and the weights can be adjusted according to different data sets, so that the method has better generalization capability and further higher accuracy.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The method for identifying the bus IC card passengers to get off the bus when the travel chain is broken comprises the following steps as shown in fig. 1 and 2:
1) According to IC card swiping data and operation vehicle data of the conventional bus, a first layer of Stacking frame is used for identifying the stop point of the passenger who swipes the IC card of the conventional bus; wherein, the trip with broken trip chain of the mth passenger on the d-th day is the J-th station of the T-th shift in the f direction of the l route1 Personal siteGet on the car, the travel at the j-th place is obtained through identification2 Probability of getting off for each possible get off station, where j1 <j2 <J;
Identifying the station-off point of a conventional bus IC card swiping passenger by adopting one or more of a method based on a personal high-frequency station, a method based on a downstream station attraction, a method based on transfer convenience probability, a method based on a land property attraction probability and a method based on a group history record by using a first layer of Stacking framework, and respectively obtaining the station-off point of the conventional bus IC card swiping passenger at the j2 Station points of possible departureGet-off probability->
2) And 3) taking the identification result in the step 1) as input, and identifying the stop of the passenger getting off the bus IC card by using a second-layer Stacking framework based on a Logistic regression model.
In step 1), the method for determining the getting-off probability of the possible getting-off station based on the personal high-frequency station method is as follows:
statistics of mth passenger during study period D day, at jth2 Station points of possible departureThe total number of times of card swiping on the car is->Whether the departure station point is identified by a travel chain method or not is counted; the next trip is at j2 Possible departure stops->The probability of getting off is as follows:
the method for determining the getting-off probability of the possible getting-off station based on the method of the attraction right of the downstream station is as follows:
statistics of mth passenger at jth1 Personal siteIn the bus shift of the bus, at the j2 Possible departure stops->The total number of times of card swiping on the car is->Whether the departure station point is identified by a travel chain method or not is counted; the next trip is at j2 Possible departure stops->Is below (1)The probability of the vehicle is as follows:
the method for determining the getting-off probability of the possible getting-off station based on the transfer convenience probability is as follows:
statistics of j according to bus static line station information2 Station points of possible departureBus route number->The next trip is at j2 Possible departure stops->The probability of getting off is as follows:
and, because at least one bus route passes through one bus stop, therefore,
the method for determining the getting-off probability of the possible getting-off station based on the method of the land property attraction probability is as follows:
let j be2 Station points of possible departureH city construction land types are shared in the surrounding research areas of the road, and the road is in the j th place2 Possible departure stops->The probability of getting off is as follows:
wherein C ish For the H e {1, 2.,. Sup.H } city construction land type attraction coefficient,for possibly getting off the station>Around h city construction land type.
The method for determining the getting-off probability of the possible getting-off station based on the group history method is as follows:
a) Clustering the bus IC card data of the identified departure points into clusters, taking the bus IC card data of the same cluster for identifying the departure points based on a travel chain method as a history group record, and determining the departure points to be identified;
b) The method comprises the steps of constructing clustering indexes, wherein the clustering indexes comprise two types, and specifically comprise the following steps:
the first type is related fields in the bus IC card data of the identified bus stop, such as fields of card numbers, card types and the like, and is used for recording data generated by each card swiping, ensuring that each piece of IC card data is a record generated by one card swiping, and ensuring that the minimum unit of the partial aggregation is data.
The second class is a plurality of indexes constructed according to the first class clustering indexes and actual conditions and used for mining the similarity among different IC card data. For example, the boarding station type index is that the station number and the total passenger flow of the boarding station are clustered into three categories based on a K-Means algorithm, namely three passenger flow center station types, which can help the data of boarding at the same passenger flow center station level to be clustered in the same cluster better; the loyalty index of the passengers is obtained by gathering the number of card swiping times and the corresponding number of cards into three types based on a K-Means algorithm, and the data with more similar behavioral habits can be gathered in the same cluster.
C) And (3) selecting a plurality of clustering indexes, normalizing the selected clustering indexes, scaling the clustering indexes by adopting maximum and minimum standardization, enabling the index value to be located between a given minimum value and a given maximum value, and scaling the characteristic value of each clustering index to the unit size.
D) Clustering based on K-Means algorithm, and combining elbow rule to determine the best clustering class number CG Obtaining a card swiping travel mode
E) Trip data for setting the mth passenger on the d-th day and the b-th trip chain to break belong to a clusterThen cluster->Determining records of the departure station points as a group history record data set based on a travel chain method; and determining to get on and at j according to the group history data set2 Possible departure stops->The frequency of getting off is +.>The next trip is at j2 Possible departure stops->The probability of getting off is as follows:
the step 2) is specifically as follows:
2.1 The possible get-off station points obtained through identification in the step 1) are respectively marked as 0 or 1, the possible get-off station points marked as 1 are the identified correct get-off station points, the possible get-off station points marked as 0 are the incorrect get-off station points, and the correct get-off station points and the incorrect get-off station points are used as input of a Logistic regression model of the second layer of Stacking framework. At this time, the input value of the Logistic regression model of the second layer Stacking framework is between [0,1], and normalization is not needed.
For the travel of the mth passenger on the d-th day and the b-th travel chain break, outputting the Logistic regression model as the j-th travel2 The probability of getting off a potential get off station is as follows:
wherein,,is an input vector, +.>Pm,d,b (j1 ,j2 ) Is->One or more of the following; />Is a weight vector, ">W is W1 、w2 、w3 、w4 、w5 One or more of which are respectively represented by +.>Weights, w0 Is biased;
in the present embodiment of the present invention,
2.2 Using bus IC card data based on travel chain method to identify the station at the next station, and using the bus IC card data as training set and test set to learn the model. In practical situations, the number of possible departure stops is often large, so that the data size of the tag 0 is far greater than the data size of the tag 1. For this data imbalance phenomenon, i.e. if the number of incorrect alighting stations is greater than the number of correct alighting stations, the following steps are performed:
2.2.1 Randomly undersampling the data of the incorrect get-off station (the data with the label of 0) and merging the data with the original data of the correct get-off station (the original data with the label of 1);
2.2.2 Oversampling based on SMOTE algorithm is performed on the data of the correct departure station (the raw data with the label of 1), and the data is combined with the raw data of the incorrect departure station (the raw data with the label of 0);
2.2.3 Combining the data obtained after the step 2.2.1) and the step 2.2.2), selecting 90% of the data as a training data set of the Logistic regression model, and the remaining 10% as a test set of the Logistic regression model.
2.3 Selecting a maximum likelihood estimation method to estimate model parameters, and converting the problem into an optimization problem aiming at a maximization criterion function; and the L-BFGS (Limited-memory BFGS) algorithm suitable for large-scale data calculation is adopted for determining the parameter values; then at j2 Station points of possible departureThe probability of getting off is as follows:
wherein,,is->Maximum likelihood estimates of (a);
2.4 (m) the trip station point of trip with broken travel chain of the mth passenger on the d-th dayIs the j-th with the highest probability of getting off in the possible getting-off station2 Possible departure stops->The method comprises the following steps:
further, after the number of the station is determined, the name of the station and the longitude and latitude of the station can be determined by combining with the station information of the static bus line.
Examples
1. Introduction to Experimental objects and data sets
Rapid Transit (BRT) is a new type of public passenger transport system that is interposed between rapid transit and conventional buses. The Xiamen BRT can collect gate access information of the passenger IC card and determine the complete boarding and alighting station of each IC card data. Meanwhile, the physical isolation of the BRT special lanes of the Xiamen enables a plurality of BRT lines to form a small bus traffic network.
The study problem of this example is how to identify a bus IC card passenger at a departure station to be identified when a travel chain breaks, and this problem will be formally described in connection with three bus trips of a certain passenger on a certain day shown in fig. 3.
First trip: station k of passenger in upward direction of line A1 Get on the bus, and can determine the possible station point set { k }2 ,k3 ,k4 ,k5 ,k6 According to the distance between the boarding station and each possible alighting station, the arrival time of the bus GPS and the IC card swiping time of the second trip, the station can be determined to be the first alighting station based on a trip chain method.
And (5) traveling for the second time: station k of passenger in descending direction of B line5 Get on the bus, and can determine the possible station point set { k }2 ,k3 ,k4 And the distance between the station point of the third trip and each possible get-off station exceeds a threshold value, so that a trip chain breaks, and the second trip get-off station cannot be determined based on a trip chain method.
And (3) traveling for the third time: station k of passenger in C line uplink direction1 Get on the bus, and can determine the possible station point set { k }2 ,k3 ,k4 And the last trip of the present day is the current trip, so that the last station point of the first trip of the present day is assumed to be the last station point of the next trip, and whether the next station point can be determined according to a trip chain method is judged.
In this example, since the station distance between the boarding station for the first trip and each possible alighting station exceeds the threshold, the trip chain breaks, and the alighting station for the third trip cannot be determined based on the trip chain method.
To sum up, as shown in fig. 3, in three bus trips of a certain passenger on a certain day: the first trip is complete in the trip chain, and is not a study object; the second trip and the third trip are trips when the trip chain breaks, and the identification of the departure station points of the two trips is the research problem of the example.
In this example, IC card data of a station along the BRT fast 1 line on which the gate access machine of the station has been identified in 2018, 11 in xiaomen, city, fujian, is selected as a study object, and IC card data of the gate access machines of the rest BRT having been identified in the same period of time are selected to assist in identifying the station point of the station based on the travel chain method, as shown in fig. 4. In the research period, the IC card data of the recognized boarding station, namely the boarding station, is 3673184 in total in the BRT 1-line station, and the card types of the boarding station can be divided into student cards, old people cards, common cards and special cards. In the research period, the early peak of Xiamen city is 7:00:00-9:00:00, the late peak is 17:00:00-19:00:00, and whether the raining condition exists every day in the research period can be determined according to the weather conditions issued by the China weather exchange.
The fast 1 line of the Xiamen BRT is divided into an uplink and a downlink, each of which passes through 27 identical sites but has opposite sequences, and spans three administrative areas of a Siming area, a lake area and a beauty area, and the types and the areas of the urban construction land within the range of 800 meters around each site are shown in figure 5. The magnitude of the land use property attraction coefficient around the site is related to the scale of the research city, and the values in cities with similar scales are similar, so the invention determines the land use property attraction coefficient of each city construction land according to the city scale of the Xiamen city and related research, and the land use property attraction coefficient is shown in table 1.
Table 1: site-surrounding land use property attraction coefficient
| Land use Properties | Coefficient of attraction |
| Residence land | 1 |
| Commercial service facility land | 1.2 |
| Public management and public service land | 1.1 |
| Industrial land | 1 |
| Logistics binStorage land | 0.6 |
| Public land | 0.8 |
| Greenbelt and plaza land | 0.7 |
| Traffic facility land | 1.3 |
Meanwhile, the number of BRT lines passing through each site can be determined according to static line site information of the BRT as shown in fig. 6.
2. Evaluation method and index
The test of the present invention will be divided into two parts: in a second layer based on a two-layer Stacking frame method, identifying a Logistic regression model for learning bus IC card data of a lower station point according to a travel chain method; and (5) checking the identification of the station points of the passengers getting off the bus IC card when the travel chain breaks.
(1) Verification of Logistic regression models
The inspection method comprises the following steps: the training set and the testing set of the Logistic regression model come from bus IC card data for identifying the departure station point based on a travel chain method.
And (3) checking the index: and F1 score is selected as a test index, and the closer the value is to 1, the better the learned Logistic regression model is. Let this part have the actual tag value 1 and be predicted as tag value 1A bar record; actual tag value 1 and predicted as tag value 0 +.>Stripe data; actual tag value 0 and is predicted as tagValue 1 +.>The bar record has the following F1 score:
(2) Inspection of bus IC card passenger getting-off station when trip chain breaks
The inspection method comprises the following steps: and if the existing departure station point is missing, identifying the departure station point for the IC card data when the travel chain is broken, and comparing and checking the identified departure station point with the actual departure station point.
And (3) checking the index: the identification rate and the accuracy rate are selected as the inspection indexes, and the method is better as the numerical value of the identification rate and the accuracy rate are close to 100%. Let the present part share Nun The data of the broken trip chain needs to be identified as the station for getting off the vehicle, whereinThe bar may be determined to be the next stop at which point there is an identification rate as follows:
if it isThere is +.>The accuracy rate of the bar data when the bar data is correctly identified to the next station is as follows:
3. setting of experimental parameters
The parameter values involved in this example are determined as follows:
(1) Distance threshold setting based on travel chain method
According to the actual situation of BRT site spacing in Xiamen city, the part sets the radius Dis of the research range of the land utilization property around the BRT siteland-use =800 meters, the distance threshold based on the travel chain method is 2000 meters.
(2) Penalty coefficient determination for Logistic regression model
In the second layer of the method based on the two-layer Stacking framework, in order to enable the Logistic regression model to better exert potential, the method determines that the optimal penalty coefficient is 100 through multiple experiments.
(3) Parameter setting for contrast method
And (3) selecting a KNN-based method, a decision tree-based method, a random forest-based method and a passenger high-frequency station and downstream station attraction right-based method as comparison.
In the KNN-based method, the decision tree-based method and the random forest-based method, the existing research uses POI data to replace site surrounding land utilization properties for research, however, the site surrounding land utilization properties exist in the known data used in the invention, so that the site surrounding POI data is not required to be used for replacement. Meanwhile, through multiple experiments, the nearest neighbor sample value suitable for the partial data set in the KNN-based method is determined to be 1000, the number of established trees in the random forest-based method is determined to be 2000, and the coefficient of foundation is selected as a standard in the decision tree-based method.
In the method based on the attraction of the high-frequency station and the downstream station and the method based on the attraction of the downstream station in the first layer of the Stacking framework of the method, the passenger flow of the station in the shift needs to be selected to determine the off station when the travel chain breaks. However, in the embodiment, only BRT data recorded by the gate of the passenger in and out is used, and it is not known which shift the passenger takes, so that the passenger traffic of each stop in the hour where the passenger in-station moment is located is used as the passenger traffic of each stop of the shift where the passenger is located, and the probability calculation is performed to determine the get-off stop.
4. Example results
According to the setting of experimental parameters, for 3673184 pieces of IC card data of the boarding station identified in the research period, the departure station point (the accuracy rate is 80.96%) of 2425101 pieces of data can be identified based on a travel chain method, and the records of 1248083 travel chain breaks are left, so that the identification of the boarding station can be performed by a KNN-based method, a decision tree-based method, a random forest-based method, a passenger high-frequency station and downstream station attraction right-based method and the method.
And then, respectively carrying out example result display on a method based on group history records, a Logistic regression model based on a second layer of the two-layer Stacking framework and identification of the station getting-off point of the IC card passenger when the travel chain breaks.
(1) Method based on group history record
The frequency of swiping the 3673184 pieces of IC card data of the boarding site identified in the study period was counted as shown in fig. 7. It can be found that the number of cards decreases as the frequency of card swiping by passengers increases, and that the frequency of card swiping by more than 80% of IC cards is 8 times or less.
When the passenger loyalty index based on the group history method is constructed, the card swiping times and the corresponding card quantity of the research data of 2018 in Xiamen city are clustered into three types based on a K-Means algorithm, and then a result is obtained: when the number of card swipes is 1 or 2 per month, the passenger occasionally rides BRT, and the loyalty of the passenger=1; when the number of card swipes is 3 to 8 in one month, BRT is a traffic mode selectable by the passenger, and the loyalty of the passenger=2; when the number of swipes is greater than 8 for one month, the passenger is a faithful user for BRT travel, and the loyalty of the passenger=3. At this time, the IC card data amount ratio of various card types under various loyalty can be obtained as shown in fig. 8.
Constructing 11 clustering indexes according to actual data in a research time period of Xiamen city, as shown in table 2; and maximum and minimum normalization is performed. Subsequently, clustering based on the K-Means algorithm can be performed to obtain the variation of the average distortion degree with the number of clusters, as shown in FIG. 9.
Table 2: index introduction based on 1C card data clustering in group history recording method
According to fig. 9 and the elbow rule, it is determined that the clustering is optimal when it is 2 clusters. At this time, 3673184 pieces of IC card data having identified the boarding station within the study period are divided into clusters R1 And cluster R2 The number of records of each cluster is shown in table 3.
Table 3: number of recordings of different clusters
When two travel chains are broken, the travel is carried out at the 8 th station in the 1 st line uplink direction of the BRT of the Xiamen, but the two travel chains respectively belong to the cluster R1 And cluster R2 Based on the method of group history, according to the data of the alighting stations identified by the same group based on the travel chain method, the alighting probability of the alighting stations at each possible alighting station for two trips can be determined, as shown in fig. 10. Belonging to cluster R1 The trip of (2) is the highest in the 14 th station, and belongs to the cluster R2 The trip of (2) is the highest in the 16 th station.
(2) Logistic regression model in second layer based on two-layer Stacking framework
The model uses 2425101 pieces of data for identifying the station points at the departure based on a travel chain method to learn and test the model. At this time, after the IC card data are associated with each possible departure station, 40113752 pieces of data can be obtained, wherein 2425101 pieces of records with the correct tag value of 1 for the departure station are included, and tag values of the remaining 37688651 pieces of records are 0. Due to the data imbalance phenomenon, these records will be resampled as shown in table 4.
Table 4: number of IC card data before and after sampling
As can be seen from table 4, in order to prevent the under fitting of the model caused by too small data amount, the data amount of tag 0 after sampling is still large, but the data ratio of tag value 0 to tag value 1 is already from 15.54 before sampling: 1 falls to 1.82:1, has better improvement.
For the 75178131 sampled data, 90% of the data were randomly selected as the training set for the Logistic regression model, and the remaining 10% were used as the test set. When the penalty coefficient is 100, the score of the test index F1 is 0.67, and the parameters of the Logistic regression model obtained by the method are as follows:
(3) Identification of IC card passenger getting-off station point when trip chain breaks
The identification of the train station recorded by the IC card when the travel chain is broken by the method based on the two-layer Stacking framework is shown in table 5.
Table 5: results of different methods for identifying a get-off station
And (3) for different periods of different days of the week, the accuracy rate of identifying the get-off station based on a two-layer Stacking frame method is studied. The accuracy per time period is counted by different days of the week and the division in units of hours is made in terms of the passenger transaction time, as shown in fig. 11.
And (3) for different card types and different time periods, researching the accuracy rate of identifying the get-off station based on a two-layer Stacking frame method. The accuracy per time period is counted according to different card types and the division in units of hours is performed in the passenger transaction time, as shown in fig. 12.
For passengers with different card types and different loyalty, the accuracy of identifying the getting-off station based on a two-layer Stacking frame method is studied. The accuracy of each division is counted by dividing according to different card types and different loyalty, as shown in fig. 13.
5. Analysis of results
(1) As shown in fig. 7, the frequency of card swiping of IC cards exceeding 80% is small, resulting in that in the method based on personal history, the collection of data of individuals who recognize the departure points by the trip chain method as personal history data sets is small, so the present invention proposes a method based on group history is necessary in order to improve the disadvantage of the method that personal history data sets are small.
As shown in fig. 11, identifying the next station data based on the trip chain method as the historical data set in different clusters shows different performances, and clustering based on the data layer can effectively bring similar data together, which illustrates that the group-based historical record method of clustering IC card data as the minimum unit is effective.
(2) As can be obtained from table 5, the recognition rate based on the two-layer Stacking frame method provided by the invention is 100.00%, and the recognition rate based on the KNN method, the decision tree method and the random forest method can be the same, so that the recognition of the next station point of all the IC card data when the travel chain breaks can be performed, and the recognition rate is slightly higher than the recognition rate based on the passenger high-frequency station and downstream station attraction method.
(3) As can be seen from table 5, the two-layer Stacking frame method based on the integrated meter and the non-integrated meter model, the passenger high-frequency station based on the passenger high-frequency station and the downstream station attraction method have much higher accuracy than the KNN based method, the decision tree based method and the random forest based method which only use a single model, and the integrated meter and the non-integrated meter model can obtain better effects than the method which only uses one model. In the two methods which are the integrated meter and the non-meter model, the accuracy of the method is higher than that of the other method, and the method has the advantages that the method is effective in determining the weights of the methods in the first layer aiming at different data sets by using the Logistic regression model in the second layer, and the obtained model parameters are more suitable for the data sets and have better generalization capability.
(4) As shown in fig. 11, the accuracy is highest in the early 11-month working day peak period (7:00:00-9:00:00) in 2018, and higher than the accuracy in the non-working day period; meanwhile, as shown in fig. 8 and 12, the data amount of the IC card is recorded by a normal card and a student card with a data amount ratio exceeding 93.5%, and the accuracy in the early peak period is higher than that in other periods.
Thus, the working day early peak period is the most regular time period for the passengers to travel, and is consistent with the actual situation that a large number of commuter/school passengers enter from a fixed residence to a fixed workplace/school at early peak. As can be seen from analysis of fig. 8 and 13, whichever card type corresponds to the situation that the more the card is swiped and the higher the accuracy of identifying the departure points, the higher the loyalty of the passenger to select the BRT, and the more regular the travel behavior from the departure point to the destination.
The method is greatly different from the existing typical station point identification method, can be used for comprehensive analysis and comparison in the aspects of a method system, a data volume application range, an identification rate and the like, and is particularly shown in a table 6.
Table 6: the invention is compared with the prior typical method for identifying the station points at the next station by different point analysis
The above examples are only for illustrating the present invention and are not to be construed as limiting the invention. Variations, modifications, etc. of the above-described embodiments are intended to fall within the scope of the claims of the present invention, as long as they are in accordance with the technical spirit of the present invention.