CN103744861A

Movatterモバイル変換

Info

Publication number: CN103744861A
Application number: CN201310687107.3A
Authority: CN
Inventors: 黄鑫; 罗军
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Guangdong Gaohang Intellectual Property Operation Co ltd; Jiangsu Aerospace Dawei Technology Co Ltd
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2014-04-23
Anticipated expiration: 2033-12-12
Also published as: CN103744861B

Abstract

Translated fromChinese

本发明适用于数据处理技术领域，提供了一种轨迹数据中的频繁子轨迹查找方法及装置，包括：分离轨迹数据中的空间信息和时间信息；将所述空间信息编码成第一类字符，每个所述第一类字符用于表示一个地理位置；将所述时间信息编码成第二类字符，每个所述第二类字符用于表示一段间隔时间；根据编码成所述第一类字符的所述空间信息和编码成所述第二类字符的所述时间信息，建立广义后缀树；查找所述广义后缀树中的频繁子字符串；将查找出的所述频繁子字符串转换成频繁子轨迹。本发明通过使用较为高效的字符串算法来处理较为复杂的多维数值数据，使得整个频繁子轨迹查找过程的计算复杂度大大降低。

The present invention is applicable to the technical field of data processing, and provides a method and device for searching frequent sub-trajectories in trajectory data, including: separating spatial information and time information in trajectory data; encoding the spatial information into first-type characters, Each of the first-type characters is used to represent a geographic location; the time information is encoded into a second-type character, and each of the second-type characters is used to represent a period of time interval; according to encoding into the first type The space information of characters and the time information encoded into the second type of characters are used to establish a generalized suffix tree; search for frequent substrings in the generalized suffix tree; convert the found frequent substrings into frequent sub-trajectories. The present invention uses a relatively efficient string algorithm to process relatively complex multi-dimensional numerical data, so that the calculation complexity of the entire frequent sub-track search process is greatly reduced.

Description

Translated fromChinese

一种轨迹数据中的频繁子轨迹查找方法及装置A method and device for searching frequent sub-trajectories in trajectory data

技术领域technical field

本发明属于数据处理技术领域，尤其涉及一种轨迹数据中的频繁子轨迹查找方法及装置。The invention belongs to the technical field of data processing, in particular to a method and device for searching frequent sub-trajectories in trajectory data.

背景技术Background technique

轨迹数据就是时空环境下，通过对一个或者多个移动对象运动过程的采样所获得的数据信息，包括采样点位置、采样时间、速度等，这些采样点数据信息根据采样先后顺序构成了轨迹数据。常见的轨迹数据包括车辆行驶轨迹、移动互联网用户的旅行轨迹、移动互联网用户的签到轨迹，等等，海量的轨迹数据里蕴含着丰富的信息，其频繁子轨迹可以表现大多数人的行为模式及习惯，或者表现气候的变化规律等。Trajectory data is the data information obtained by sampling the motion process of one or more moving objects in the space-time environment, including the position of sampling points, sampling time, speed, etc., and the data information of these sampling points constitutes trajectory data according to the order of sampling. Common trajectory data include vehicle driving trajectory, travel trajectory of mobile Internet users, check-in trajectory of mobile Internet users, etc. Massive trajectory data contains rich information, and its frequent sub-trajectories can represent most people's behavior patterns and Habits, or the law of climate change.

由于轨迹数据是数值数据，不能直接套用目前已相当成熟的字符串频繁子串的查找算法来查找轨迹数据中的频繁子轨迹，因此，现有技术中大多直接对轨迹数据进行划分并聚类，将长度为O(n)的轨迹划分为O(n²)个子轨迹，再对这些子轨迹进行聚类分析来发现频繁子轨迹，整个过程计算复杂度高，运算时间长。Since the trajectory data is numerical data, it is not possible to directly apply the currently quite mature search algorithm for frequent substrings of character strings to find frequent sub-trajectories in the trajectory data. Therefore, most of the existing technologies directly divide and cluster the trajectory data. Divide the trajectory with length O(n) into O(n² ) sub-trajectories, and then perform cluster analysis on these sub-trajectories to find frequent sub-trajectories. The whole process has high computational complexity and long operation time.

发明内容Contents of the invention

本发明实施例的目的在于提供一种轨迹数据中的频繁子轨迹查找方法，旨在解决现有的在轨迹数据中查找频繁子轨迹的算法计算复杂度高的问题。The purpose of the embodiments of the present invention is to provide a method for searching frequent sub-trajectories in trajectory data, aiming at solving the problem of high computational complexity of existing algorithms for searching frequent sub-trajectories in trajectory data.

本发明实施例是这样实现的，一种轨迹数据中的频繁子轨迹查找方法，包括：The embodiment of the present invention is achieved in this way, a method for searching frequent sub-trajectories in trajectory data, comprising:

分离轨迹数据中的空间信息和时间信息；Separate spatial and temporal information in trajectory data;

将所述空间信息编码成第一类字符，每个所述第一类字符用于表示一个地理位置；Encoding the spatial information into first-type characters, each of the first-type characters is used to represent a geographic location;

将所述时间信息编码成第二类字符，每个所述第二类字符用于表示一段间隔时间；Encoding the time information into a second type of characters, each of the second type of characters is used to represent an interval of time;

根据编码成所述第一类字符的所述空间信息和编码成所述第二类字符的所述时间信息，建立广义后缀树；building a generalized suffix tree according to the spatial information encoded into the first type of characters and the time information encoded into the second type of characters;

查找所述广义后缀树中的频繁子字符串；Find frequent substrings in the generalized suffix tree;

将查找出的所述频繁子字符串转换成频繁子轨迹。The found frequent substrings are converted into frequent sub-trajectories.

本发明实施例的另一目的在于提供一种轨迹数据中的频繁子轨迹查找装置，包括：Another object of the embodiments of the present invention is to provide an apparatus for searching frequent sub-trajectories in trajectory data, including:

分离单元，用于分离轨迹数据中的空间信息和时间信息；A separation unit for separating spatial information and temporal information in the trajectory data;

第一编码单元，用于将所述空间信息编码成第一类字符，每个所述第一类字符用于表示一个地理位置；a first encoding unit, configured to encode the spatial information into first-type characters, each of the first-type characters is used to represent a geographic location;

第二编码单元，用于将所述时间信息编码成第二类字符，每个所述第二类字符用于表示一段间隔时间；a second encoding unit, configured to encode the time information into a second type of characters, each of the second type of characters is used to represent an interval of time;

建立单元，用于根据编码成所述第一类字符的所述空间信息和编码成所述第二类字符的所述时间信息，建立广义后缀树；A building unit, configured to build a generalized suffix tree according to the spatial information encoded into the first type of characters and the time information encoded into the second type of characters;

查找单元，用于查找所述广义后缀树中的频繁子字符串；A search unit, configured to search for frequent substrings in the generalized suffix tree;

转换单元，用于将查找出的所述频繁子字符串转换成频繁子轨迹。A conversion unit, configured to convert the found frequent substrings into frequent sub-trajectories.

本发明实施例结合了数据挖掘技术、后缀树算法以及非精确匹配，从而实现了较优的轨迹数据中的频繁子轨迹的查找，通过使用较为高效的字符串算法来处理较为复杂的多维数值数据，使得整个频繁子轨迹查找过程的计算复杂度大大降低。The embodiment of the present invention combines data mining technology, suffix tree algorithm and non-exact matching, thereby realizing the search of frequent sub-trajectories in better trajectory data, and processing more complex multi-dimensional numerical data by using a more efficient string algorithm , so that the computational complexity of the entire frequent sub-trajectory search process is greatly reduced.

附图说明Description of drawings

图1是本发明实施例提供的轨迹数据中的频繁子轨迹查找方法的实现流程图；Fig. 1 is the implementation flowchart of the frequent sub-track search method in the track data provided by the embodiment of the present invention;

图2是本发明实施例提供的轨迹数据中的频繁子轨迹查找方法S102的具体实现流程图；FIG. 2 is a specific implementation flowchart of the frequent sub-trajectory search method S102 in the trajectory data provided by the embodiment of the present invention;

图3是本发明实施例提供的轨迹数据中的频繁子轨迹查找方法对空间信息进行聚类的示意图；3 is a schematic diagram of clustering spatial information by a frequent sub-trajectory search method in trajectory data provided by an embodiment of the present invention;

图4是本发明实施例提供的轨迹数据中的频繁子轨迹查找方法S103的具体实现流程图；Fig. 4 is a specific implementation flowchart of the frequent sub-trajectory search method S103 in the trajectory data provided by the embodiment of the present invention;

图5是本发明实施例提供的轨迹数据中的频繁子轨迹查找方法建立的广义后缀树的示意图；Fig. 5 is a schematic diagram of the generalized suffix tree established by the frequent sub-trajectory search method in the trajectory data provided by the embodiment of the present invention;

图6是本发明实施例提供的轨迹数据中的频繁子轨迹查找装置的结构框图。Fig. 6 is a structural block diagram of an apparatus for searching frequent sub-trajectories in trajectory data provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

图1示出了本发明实施例提供的轨迹数据中的频繁子轨迹查找方法的实现流程，详述如下：Fig. 1 shows the implementation process of the frequent sub-trajectory search method in the trajectory data provided by the embodiment of the present invention, which is described in detail as follows:

在S101中，分离轨迹数据中的空间信息和时间信息。In S101, space information and time information in trajectory data are separated.

轨迹数据中包括了空间信息和时间信息，其中，空间信息一般包括所在位置的经度、纬度等，而时间信息通常通过unix时间戳进行表示。The trajectory data includes spatial information and time information, wherein the spatial information generally includes the longitude and latitude of the location, and the time information is usually represented by a unix timestamp.

表1为一段轨迹数据的具体示例，其中，记录的时间信息为进入对应的经度及纬度的unix时间戳：Table 1 is a specific example of a piece of track data, where the recorded time information is the unix timestamp entered into the corresponding longitude and latitude:

表1Table 1

在S101中，首先对轨迹数据中的空间信息和时间信息进行分离，将轨迹数据分离成一个空间信息序列，例如{(113.333,22.368),(113.111,23.013)…}，以及一个时间信息序列，例如{1385521584,1385521233…}。In S101, the spatial information and time information in the trajectory data are first separated, and the trajectory data is separated into a spatial information sequence, such as {(113.333, 22.368), (113.111, 23.013)...}, and a time information sequence, For example {1385521584,1385521233...}.

在S102中，将所述空间信息编码成第一类字符，每个所述第一类字符用于表示一个地理位置。In S102, the spatial information is encoded into characters of the first type, and each character of the first type is used to represent a geographic location.

在本实施例中，对S101分离出的空间信息进行聚类，转化成相应的地理位置，再编码成字符，且编码成的字符用于表示对应的地理位置。如图2所示，S102具体为：In this embodiment, the spatial information separated in S101 is clustered, transformed into a corresponding geographic location, and then encoded into characters, and the encoded characters are used to represent the corresponding geographic location. As shown in Figure 2, S102 is specifically:

在S201中，对所述空间信息进行聚类，生成N个簇，所述N为大于1的整数。In S201, cluster the spatial information to generate N clusters, where N is an integer greater than 1.

根据S101中分离出的轨迹数据的空间信息序列，基于经度和纬度将轨迹数据的空间信息进行聚类。具体地，可以采用基于密度的聚类算法（Density-BasedSpatial Clustering of Applications with Noise，DBSCAN）来实现空间信息的聚类。作为本发明的一个实现示例，如图3所示，其中的每一个点表示空间信息序列中的一个记录，基于这些点对应的经度数值和纬度数值对这些点进行聚类，最后生成了A、B、C三个簇，而剩余的孤立点则作为噪音被排除出去，不参与接下来的计算过程。According to the spatial information sequence of the trajectory data separated in S101, the spatial information of the trajectory data is clustered based on longitude and latitude. Specifically, a density-based clustering algorithm (Density-BasedSpatial Clustering of Applications with Noise, DBSCAN) can be used to implement clustering of spatial information. As an implementation example of the present invention, as shown in Figure 3, each point represents a record in the spatial information sequence, these points are clustered based on the longitude and latitude values corresponding to these points, and finally A, There are three clusters B and C, and the remaining isolated points are excluded as noise and do not participate in the next calculation process.

在S202中，分别确定生成的每个簇所对应的地理位置。In S202, the geographic location corresponding to each generated cluster is respectively determined.

在本实施例中，根据每个簇所涉及到的经度和纬度等位置信息，通过对比地图上的位置进行判别，以确定出每个簇所对应的地理位置。通常说来，在真实的数据采集过程中，生成的簇可以代表现实社会中的一个城市或者城市里的一个商圈，等等。In this embodiment, according to the location information such as longitude and latitude involved in each cluster, the location information corresponding to each cluster is determined by comparing the location on the map. Generally speaking, in the real data collection process, the generated clusters can represent a city in the real society or a business district in the city, and so on.

在S203中，根据每个簇所对应的地理位置进行字符编码，分别生成每个簇对应的所述第一类字符。In S203, character encoding is performed according to the geographic location corresponding to each cluster, and the first type of characters corresponding to each cluster are respectively generated.

经过图2所示步骤，轨迹数据中的空间信息例如{(113.333,22.368),(113.111,23.013)…}则会被转化成{A,C,…}，其中，A表示(113.333,22.368)所在的簇对应的字符，C表示(113.111,23.013)所在的簇对应的字符，由此实现了轨迹数据中的空间信息由原始二维数值至地理区域的转换。After the steps shown in Figure 2, the spatial information in the trajectory data such as {(113.333, 22.368), (113.111, 23.013)...} will be converted into {A,C,...}, where A represents (113.333,22.368) The character corresponding to the cluster where it is located, and C indicates the character corresponding to the cluster where (113.111, 23.013) is located, thus realizing the transformation of the spatial information in the trajectory data from the original two-dimensional value to the geographic area.

在S103中，将所述时间信息编码成第二类字符，每个所述第二类字符用于表示一段间隔时间。In S103, the time information is encoded into characters of a second type, and each character of the second type is used to represent an interval time.

如图4所示，S103具体为：As shown in Figure 4, S103 is specifically:

在S401中，将所述时间信息由时间戳转换成间隔时间。In S401, the time information is converted from a time stamp into an interval time.

在S102中，已将空间信息序列已经转换成字符序列格式，而原始的时间信息是一个时间戳序列，序列里的每一个时间戳均代表进入空间信息序列中相应地理位置的时间，则在S103中，需要将时间戳转换成间隔时间，以确定出从进入A到进入B之间的时间差。例如，转换后的空间信息序列为{A,A,B,…}，对应的间隔时间序列为{t₁,t₂,t₃,…}，则进入A的时间是t₁，进入B的时间是t₃，那么从进入A到进入B之间的时间间隔是t₃-t₁。In S102, the spatial information sequence has been converted into a character sequence format, and the original time information is a time stamp sequence, and each time stamp in the sequence represents the time of entering the corresponding geographic location in the spatial information sequence, then in S103 In , it is necessary to convert the timestamp into an interval time to determine the time difference between entering A and entering B. For example, the converted spatial information sequence is {A,A,B,…}, and the corresponding interval time sequence is {t₁ ,t₂ ,t₃ ,…}, then the time to enter A is t₁ , and the time to enter B is The time is t₃ , then the time interval from entering A to entering B is t₃ -t₁ .

在S402中，标准化所述间隔时间。In S402, the interval time is standardized.

在S402中，对转换得到的间隔时间进行标准化之前，首先先将其中过长的间隔时间作为噪音去除，再对剩余的间隔时间进行标准化。具体地，可以通过下式对间隔时间进行标准化：In S402 , before normalizing the converted interval times, the excessively long interval times are firstly removed as noise, and then the remaining interval times are normalized. Specifically, the interval time can be standardized by the following formula:

interval(k）=interval(k)/max_m＝１：ninterval(m)interval(k)=interval(k)/max_m=1:n interval(m)

其中，interval(k)表示第k个间隔时间，max_m＝1：ninterval(m)表示n个有效间隔时间里的最大值，具体的标准化方法，即是取所有有效间隔时间里的最大值，再将第k个间隔时间除以该最大值，即得到了标准化的第k个间隔时间。对于标准化的第k个间隔时间，作为本发明的一个实现示例，其结果可以精确到0.001。Among them, interval(k) represents the kth interval time, max_m=1:n interval(m) represents the maximum value in n effective interval times, and the specific standardization method is to take the maximum value in all effective interval times , and then divide the k-th interval time by the maximum value to get the standardized k-th interval time. As for the standardized k-th interval time, as an implementation example of the present invention, the result can be accurate to 0.001.

通过对间隔时间进行标准化之后，所有的间隔时间就会转换成类似0.303、0.349、0.788等数值。After normalizing the interval time, all interval times will be converted into values like 0.303, 0.349, 0.788, etc.

在S403中，为每个标准化后的所述间隔时间匹配第二类字符。In S403, match the second type of characters for each standardized interval time.

通常，可以将数值转化成字符最简单的办法是对其进行四舍五入，例如：Usually, the easiest way to convert a numeric value to a character is to round it, for example:

间隔时间1=0.349≈0.3，间隔时间2=0.350≈0.4，则根据预设的数值与字符的对应关系，将0.3转换成字符3，将0.4转换成字符4。然而，实际上，间隔时间1与间隔时间2的真实数值极为相近，但却被匹配成了不同的字符，使得匹配结果无法反映出不同数值之间的真实差距，因此，作为本发明的一个实施例，可以采用非精确匹配的办法来解决这个问题，通过非精确匹配，为每个标准化后的所述间隔时间匹配一个第二类字符的组合，所述第二类字符的组合中包含两个第二类字符。Interval time 1=0.349≈0.3,interval time 2=0.350≈0.4, then convert 0.3 intocharacter 3 and convert 0.4 into character 4 according to the preset correspondence between values and characters. However, in fact, the real values ofinterval time 1 andinterval time 2 are very similar, but they are matched into different characters, so that the matching result cannot reflect the real gap between different values. Therefore, as an implementation of the present invention For example, non-exact matching can be used to solve this problem. Through non-exact matching, a second-type character combination is matched for each standardized interval time, and the second-type character combination includes two The second type of characters.

具体地：可以采用如下的非精确匹配方法：Specifically: the following non-exact matching methods can be used:

间隔时间1=0.349≈0.3||0.35=字符6||字符7；Interval time 1=0.349≈0.3||0.35=Character 6||Character 7;

间隔时间2=0.350≈0.35||0.4=字符7||字符8。Interval time 2=0.350≈0.35||0.4=Character 7||Character 8.

即，首先确定标准化后的间隔时间所在的预设数值区间，并确定该预设数值区间的两个数值端点，然后，根据预设的第二类字符与数值端点的对应关系，将这两个数值所分别对应的两个第二类字符匹配给该标准化后的间隔时间，从而将每个标准化后的间隔时间转换成包含了两个第二类字符的组合(字符k,字符k+1)。That is, first determine the preset numerical interval where the standardized interval time is located, and determine the two numerical endpoints of the preset numerical interval, and then, according to the preset correspondence between the second type of characters and the numerical endpoints, the two The two second-type characters corresponding to the values are matched to the standardized interval time, so that each standardized interval time is converted into a combination containing two second-type characters (character k, character k+1) .

在经过了S102和S103之后，轨迹数据即可以被转换成空间信息和时间信息交叉的字符串序列，例如：After S102 and S103, the trajectory data can be converted into a character string sequence in which spatial information and time information intersect, for example:

A（字符6）B（字符6）C…A (Character 6) B (Character 6) C...

在S104中，根据编码成所述第一类字符的所述空间信息和编码成所述第二类字符的所述时间信息，建立广义后缀树。In S104, a generalized suffix tree is established according to the space information encoded into the first type of characters and the time information encoded into the second type of characters.

后缀树（suffix tree）作为一种数据结构，能够用来支持有效的字符串匹配和查询，在本实施例中，由于时间信息是通过包含了两个第二类字符的组合来编码表示的，因此，在建树的时候，可以用第二类字符的组合中代表较小数值的那个字符来表示。例如，间隔时间1=字符6||字符7，则在建树过程中，可以用字符6来表示该间隔时间1。As a data structure, suffix tree (suffix tree) can be used to support effective character string matching and query. In this embodiment, since the time information is encoded by a combination of two second-type characters, Therefore, when building a tree, it can be represented by the character representing the smaller value in the combination of the second type of characters. For example,interval time 1=character 6||character 7, then character 6 can be used to represent theinterval time 1 during the tree building process.

在将后缀树节点上的时间信息和还未放到后缀树上的时间信息进行比较的时候，非精确匹配的具体场景如下：When comparing the time information on the suffix tree node with the time information that has not been placed on the suffix tree, the specific scenarios for inexact matching are as follows:

节点n=字符k=字符k||字符(k+1)，node n = character k = character k || character(k+1),

例如，节点n=字符6=字符6||字符7，即，需要比较的间隔时间被编码为字符6和字符7时，均能够匹配上字符6所对应的节点n。For example, node n=character 6=character 6||character 7, that is, when the time interval to be compared is coded as character 6 and character 7, the node n corresponding to character 6 can be matched.

而对于空间信息，仍采用精确匹配的方式来放置到后缀树上。As for the spatial information, it is still placed on the suffix tree by exact matching.

在本实施例中，可以采用Ukkonen算法来完成广义后缀树的建树过程，通过上述方法建立好的广义后缀树如图5所示，其中，每一个非根节点均表示一个子字符串，且通过为后缀树中的每个节点新增一个count属性，用于对该节点对应的字符串在广义后缀树中出现的次数进行计数，那么，这个子字符串出现的次数就是这个节点的所有子节点里叶节点的count属性之和：In this embodiment, the Ukkonen algorithm can be used to complete the generalized suffix tree building process. The generalized suffix tree established by the above method is shown in Figure 5, wherein each non-root node represents a substring, and is passed Add a count attribute to each node in the suffix tree to count the number of occurrences of the string corresponding to the node in the generalized suffix tree. Then, the number of occurrences of this substring is all child nodes of this node The sum of the count attributes of the leaf nodes:

Count(s)=count_leaf1+count_leaf2+count_leaf3+…，Count(s)=count_leaf1+count_leaf2+count_leaf3+...,

其中，s表示一个字符串，而Count(s)则表示由根节点出发路径为s所到达的节点的count属性值，count_leaf1，count_leaf2，count_leaf3，…则分别代表该节点的子节点里所有叶节点的count属性值。Among them, s represents a string, and Count(s) represents the count attribute value of the node reached by the path s starting from the root node, count_leaf1, count_leaf2, count_leaf3, ... respectively represent all leaf nodes in the child nodes of the node The value of the count attribute.

而反过来，每个叶节点的count属性则表示该叶节点所含二维索引的多少，设in代表一个叶节点的二维索引，则：Conversely, the count attribute of each leaf node indicates the number of two-dimensional indexes contained in the leaf node. If in represents the two-dimensional index of a leaf node, then:

in=(index1,index2)，in=(index1,index2),

其中，index1表示该叶节点对应的子字符串在哪一个字符串出现过，index2表示该叶节点对应的子字符串在该字符串出现的起始位置。Wherein, index1 indicates in which string the substring corresponding to the leaf node appears, and index2 indicates the starting position where the substring corresponding to the leaf node appears in the string.

如图5所示，(0,5)、(1,2)是其中位于最左边的叶节点的两个二维索引，该叶节点代表的字符串是后缀“A”，则(0,5)表示的是后缀“A”在第0个字符串“BANANA”中出现且出现的起始位置为5（注:按照计算机科学的习惯，此处索引号码及位置的计数都从0开始），(1,2)则代表后缀”A”在第1个字符串”ANA”里出现且出现的起始位置为2。As shown in Figure 5, (0,5) and (1,2) are two two-dimensional indexes of the leftmost leaf node, and the string represented by the leaf node is the suffix "A", then (0,5 ) indicates that the suffix "A" appears in the 0th character string "BANANA" and the starting position is 5 (Note: According to the habit of computer science, the counting of the index number and position here starts from 0), (1,2) means that the suffix "A" appears in the first character string "ANA" and its starting position is 2.

在S105中，查找所述广义后缀树中的频繁子字符串。In S105, frequent substrings in the generalized suffix tree are searched.

对于S104中建立好的广义后缀树，采取广度优先遍历的办法来进行树的遍历，如果一个节点的count属性的数值满足如下条件：For the generalized suffix tree established in S104, the breadth-first traversal method is used to traverse the tree. If the value of the count attribute of a node satisfies the following conditions:

Count(节点A)>min_repeat_times，其中，Count(节点A)表示广义后缀树中节点A的count属性值，min_repeat_times用于表示一预设阈值，Count(node A)>min_repeat_times, wherein, Count(node A) represents the count attribute value of node A in the generalized suffix tree, and min_repeat_times is used to represent a preset threshold,

则该节点代表的子字符串是频繁子字符串，即，将广义后缀树中的count属性大于预设阈值的节点所对应的字符串确定为所述频繁子字符串Then the substring represented by this node is a frequent substring, that is, the character string corresponding to the node whose count attribute in the generalized suffix tree is greater than the preset threshold is determined as the frequent substring

反之，若一个节点被判断为不是频繁子字符串，则对以节点为根的子树进行剪树，后续过程中不再遍历该节点的子节点，以此来提高频繁子字符串的查找效率。Conversely, if a node is judged not to be a frequent substring, the subtree rooted at the node will be pruned, and the child nodes of the node will not be traversed in the subsequent process, so as to improve the search efficiency of frequent substrings .

在S106中，将查找出的所述频繁子字符串转换成频繁子轨迹。In S106, the found frequent substrings are converted into frequent sub-trajectories.

对于S105中所查找出的频繁子字符串，如A(字符6)B(字符7)C…，可以逐个字符地将其转化为该字符代表的空间信息或者时间信息，具体地：For the frequent substring found in S105, such as A (character 6) B (character 7) C..., it can be converted character by character into the space information or time information represented by the character, specifically:

若该字符为第一类字符，则表示该字符代表的是空间信息，因此将该字符转化成该字符对应的地理位置；If the character is the first type of character, it means that the character represents spatial information, so the character is converted into the geographical location corresponding to the character;

若该字符为第二类字符，则表示该字符代表的是时间信息，因此取该字符以及该字符的邻居字符，转化成对应的数值并取均值。例如，当用第二类字符的组合中代表较小数值的那个字符来表示时间信息时，若字符为6，则其邻居字符为7，上述两个字符分别对应的数值为0.3和0.35，那么则对0.3和0.35取均值，从而还原出该时间信息。If the character is the second type of character, it means that the character represents time information, so take the character and the character's neighbor characters, convert them into corresponding values and take the average. For example, when the time information is represented by the character representing the smaller numerical value in the combination of the second type of characters, if the character is 6, its neighbor character is 7, and the corresponding values of the above two characters are 0.3 and 0.35 respectively, then Then take the average value of 0.3 and 0.35, so as to restore the time information.

表2示出了根据图5展示的广义后缀树最后生成的频繁子轨迹的示例：Table 2 shows an example of the final frequent subtrajectories generated according to the generalized suffix tree shown in Figure 5:

表2Table 2

本发明实施例结合了数据挖掘技术、后缀树算法以及非精确匹配，从而实现了较优的轨迹数据中的频繁子轨迹的查找，通过使用较为高效的字符串算法来处理较为复杂的多维数值数据，使得整个频繁子轨迹查找过程的计算复杂度大大降低，且合理的聚类方法也使得本发明实施例对轨迹数据空间信息的聚类划分更加准确。The embodiment of the present invention combines data mining technology, suffix tree algorithm and non-exact matching, thereby realizing the search of frequent sub-trajectories in better trajectory data, and processing more complex multi-dimensional numerical data by using a more efficient string algorithm , so that the calculation complexity of the entire frequent sub-trajectory search process is greatly reduced, and a reasonable clustering method also makes the clustering division of the spatial information of the trajectory data more accurate in the embodiment of the present invention.

图6示出了本发明实施例提供的轨迹数据中的频繁子轨迹查找装置的结构框图，该装置可以用于运行本发明图1至图5实施例所述的轨迹数据中的频繁子轨迹查找方法。为了便于说明，仅示出了与本实施例相关的部分。Fig. 6 shows a structural block diagram of a frequent sub-trajectory search device in the track data provided by an embodiment of the present invention, which can be used to run the frequent sub-track search in the track data described in the embodiments of Fig. 1 to Fig. 5 of the present invention method. For ease of description, only the parts related to this embodiment are shown.

参照图6，该装置包括；Referring to Figure 6, the device includes;

分离单元61，分离轨迹数据中的空间信息和时间信息。Aseparation unit 61, for separating spatial information and time information in the track data.

第一编码单元62，将所述空间信息编码成第一类字符，每个所述第一类字符用于表示一个地理位置。Thefirst encoding unit 62 is configured to encode the spatial information into characters of the first type, and each character of the first type is used to represent a geographic location.

第二编码单元63，将所述时间信息编码成第二类字符，每个所述第二类字符用于表示一段间隔时间。Thesecond encoding unit 63 is configured to encode the time information into characters of a second type, and each character of the second type is used to represent an interval of time.

建立单元64，根据编码成所述第一类字符的所述空间信息和编码成所述第二类字符的所述时间信息，建立广义后缀树。Thebuilding unit 64 is configured to build a generalized suffix tree according to the space information encoded into the first type of characters and the time information encoded into the second type of characters.

查找单元65，查找所述广义后缀树中的频繁子字符串。The searchingunit 65 is configured to search for frequent substrings in the generalized suffix tree.

转换单元66，将查找出的所述频繁子字符串转换成频繁子轨迹。Theconversion unit 66 is configured to convert the found frequent substrings into frequent sub-trajectories.

可选地，所述第一编码单元62包括：Optionally, thefirst encoding unit 62 includes:

聚类子单元，对所述空间信息进行聚类，生成N个簇，所述N为大于1的整数。The clustering subunit clusters the spatial information to generate N clusters, where N is an integer greater than 1.

确定子单元，分别确定生成的每个簇所对应的地理位置。Determine the sub-units, and respectively determine the geographic location corresponding to each cluster generated.

编码子单元，根据为生成的每个簇所对应的地理位置进行字符编码，分别生成每个簇对应的所述第一类字符。The encoding subunit performs character encoding according to the geographical location corresponding to each generated cluster, and respectively generates the first type of characters corresponding to each cluster.

可选地，所述第二编码单元63包括：Optionally, thesecond coding unit 63 includes:

转换子单元，将所述时间信息由时间戳转换成间隔时间。The conversion subunit converts the time information from a time stamp into an interval time.

标准化子单元，标准化所述间隔时间。A normalization subunit normalizes the interval time.

匹配子单元，为每个标准化后的所述间隔时间匹配第二类字符。The matching subunit matches the second type of characters for each standardized interval time.

可选地，所述匹配子单元具体用于：Optionally, the matching subunit is specifically used for:

确定所述标准化后的所述间隔时间所在的预设数值区间的两个数值端点；Determining the two numerical endpoints of the preset numerical interval where the standardized interval time is located;

将所述两个数值端点分别对应的两个第二类字符匹配给该标准化后的所述间隔时间。Matching the two second-type characters respectively corresponding to the two numerical endpoints to the standardized interval time.

可选地，所述装置还包括：Optionally, the device also includes:

增加单元，为所述广义后缀树中的每个节点增加一个计数属性，所述计数属性用于对该节点对应的字符串在所述广义后缀树中出现的次数进行计数；An adding unit, adding a counting attribute to each node in the generalized suffix tree, the counting attribute is used to count the number of times the character string corresponding to the node appears in the generalized suffix tree;

所述查找单元65具体用于：Thesearch unit 65 is specifically used for:

将所述广义后缀树中的所述计数属性大于预设阈值的节点所对应的字符串确定为所述频繁子字符串。Determining a character string corresponding to a node in the generalized suffix tree whose count attribute is greater than a preset threshold as the frequent substring.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.