CN102523241B

Movatterモバイル変換

Info

Publication number: CN102523241B
Application number: CN201210006268.7A
Authority: CN
Inventors: 顾仁涛; 许艳红; 纪越峰
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2012-01-09
Filing date: 2012-01-09
Publication date: 2014-11-19
Anticipated expiration: 2032-01-09
Also published as: CN102523241A

Abstract

The invention relates to a method and a device for classifying network traffic on line based on decision tree high-speed parallel processing. The method comprises the following steps of: performing acquisition, distribution and manual classification on early real traffic data, extracting the packet characteristics of an early transmission control protocol (TCP) stream set, establishing a decision tree classification model, converting a data structure, performing distribution and class judgment on a data packet to be classified, tagging a current data packet, extracting the packet characteristics of a TCP stream to be classified, and searching for a decision tree. The device comprises a decision tree construction module, a structure conversion module, a classification result processing module, a medium access control (MAC) layer processing module, a data packet polling management module, a distribution judgment module, a traffic information extraction and tagging module, and a decision tree searching module. The method and the device are low in algorithm complexity and high in processing speed, classification accuracy and stability, and can be used for equipment and systems with requirements for online traffic classification in a high speed backbone network; and online classification can be realized.

Description

Translated fromChinese

基于决策树高速并行处理的网络流量在线分类方法及装置Method and device for online classification of network traffic based on high-speed parallel processing of decision trees

技术领域technical field

本发明涉及一种网络流量在线分类的方法及装置，尤其涉及一种基于决策树高速并行处理策略实现TCP流量在线分类的方法及装置，属于通信技术领域。The invention relates to a method and device for online classification of network traffic, in particular to a method and device for realizing online classification of TCP traffic based on a high-speed parallel processing strategy of a decision tree, and belongs to the field of communication technology.

背景技术Background technique

如今，网络技术的发展越来越迅速，基于网络的应用越来越多、越来越复杂。各种各样的应用不但抢占着越来越多的网络资源，而且也对QoS和网络安全带来了巨大的威胁。在这样的背景下，如何给广大的互联网用户提供一个安全、可靠、高效的使用环境，如何发现并避免网络的异常流量，是网络管理领域需要解决的重要问题。为了解决上述这些问题，网络研究人员提出了流量调度、容量规划等一系列策略来提高网络的运营效率。然而，无论是对现有网络进行扩容改造，还是进行QoS调度，都必须对网络流量中的各种应用(如P2P、Web、IM、视频流量等)进行准确的分类与识别。此外，在网络安全、流量计费、应用趋势分析等研究领域，准确的流量分类也具有极其重要的意义。有线宽带和3G/4G的迅速推广，使得流量分类这一有效进行网络精细化管理的工具更有广阔的应用前景。Nowadays, network technology is developing more and more rapidly, and more and more network-based applications are becoming more and more complex. Various applications not only occupy more and more network resources, but also pose a huge threat to QoS and network security. Under such a background, how to provide a safe, reliable and efficient environment for the majority of Internet users, and how to discover and avoid abnormal network traffic are important issues to be solved in the field of network management. In order to solve the above problems, network researchers have proposed a series of strategies such as traffic scheduling and capacity planning to improve network operation efficiency. However, it is necessary to accurately classify and identify various applications in network traffic (such as P2P, Web, IM, video traffic, etc.) whether it is expanding the existing network or performing QoS scheduling. In addition, accurate traffic classification is also of great significance in research fields such as network security, traffic billing, and application trend analysis. The rapid promotion of wired broadband and 3G/4G makes traffic classification, a tool for effective fine-grained network management, more broad application prospects.

传统的流量分类技术主要基于传输层的端口信息，然而近年来，在互联网网络带宽不断提高以及应用层协议逐渐复杂多样的趋势下，许多网络应用与端口的相关性越来越小，伪装端口以及动态端口等情况使得上述方法已经很难适应技术和应用的发展与需求，这就迫切需要引入新的理论和技术，深层次挖掘网络应用的内在特征。为了适应Internet流量数据庞大、应用属性动态变化的特点，利用机器学习方法处理流量分类问题成为当前网络测量领域内一个新兴的研究热点。例如：朴素贝叶斯算法、改进贝叶斯算法、决策树算法、KNN算法、支持向量机算法、神经网络算法以及各种聚类算法等等。基于机器学习的流量分类技术不依赖于传输层端口号或解析有效负载来识别网络应用，而是利用流量在传输过程中表现出来的流的各种统计特征如包长、包间隔时间等来识别网络应用，方法本身不受伪装端口、动态端口、有效负载加密甚至网络地址转换的影响，在分类性能和灵活性方面，较之前述各种方法都有所突破。The traditional traffic classification technology is mainly based on the port information of the transport layer. However, in recent years, with the continuous improvement of Internet network bandwidth and the trend of increasingly complex and diverse application layer protocols, many network applications have less and less correlation with ports. Masquerading ports and Dynamic ports and other situations make it difficult for the above methods to adapt to the development and needs of technology and applications, which urgently requires the introduction of new theories and technologies to deeply explore the inherent characteristics of network applications. In order to adapt to the characteristics of huge Internet traffic data and dynamic changes of application attributes, using machine learning methods to deal with traffic classification problems has become an emerging research hotspot in the field of network measurement. For example: naive Bayesian algorithm, improved Bayesian algorithm, decision tree algorithm, KNN algorithm, support vector machine algorithm, neural network algorithm and various clustering algorithms, etc. Machine learning-based traffic classification technology does not rely on transport layer port numbers or parsing payloads to identify network applications, but uses various statistical characteristics of the flow shown during traffic transmission, such as packet length, packet interval time, etc. For network applications, the method itself is not affected by masquerading ports, dynamic ports, payload encryption or even network address translation, and has breakthroughs in classification performance and flexibility compared with the above-mentioned methods.

然而，目前业界对流量分类技术的研究还远远无法满足业务发展的需求，主要体现在目前大多数技术都采用离线分类的手段，无法实现实时在线的分类。这就限制了流量分类技术在高速骨干网中的应用。However, the current research on traffic classification technology in the industry is still far from meeting the needs of business development. This is mainly reflected in the fact that most current technologies use offline classification methods, which cannot realize real-time online classification. This limits the application of traffic classification technology in high-speed backbone networks.

为了满足目前和未来高速骨干网的需要，流量分类技术迫切需要满足以下几点要求：1)分类准确性较高，避免采用端口或者净荷作为主要识别特征；2)算法复杂度较低，具体实现设计上要有并行化处理的特性，易于硬件实现(如FPGA、CPLD、ASIC等)，保证网络流量的高速在线分类；3)分类稳定性较好，能够适用于复杂多变的网络环境。In order to meet the needs of current and future high-speed backbone networks, the traffic classification technology urgently needs to meet the following requirements: 1) The classification accuracy is high, and ports or payloads are avoided as the main identification features; 2) The algorithm complexity is low, and the specific The implementation design must have the characteristics of parallel processing, easy to implement in hardware (such as FPGA, CPLD, ASIC, etc.), and ensure high-speed online classification of network traffic; 3) The classification stability is good, and it can be applied to complex and changeable network environments.

发明内容Contents of the invention

本发明提供了一种基于决策树高速并行处理策略实现TCP流量在线分类的方法及装置，能够实现网络流量的高速实时在线分类，稳定性好，准确性高。The invention provides a method and device for realizing TCP flow online classification based on a decision tree high-speed parallel processing strategy, which can realize high-speed real-time online classification of network flow, and has good stability and high accuracy.

为实现上述的发明目的，本发明采用下述的技术方案：For realizing above-mentioned purpose of the invention, the present invention adopts following technical scheme:

一种基于决策树高速并行处理策略实现TCP流量在线分类的方法，其特征在于包括以下步骤：A method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy, is characterized in that comprising the following steps:

步骤1，前期真实流量数据的采集、分流及手工分类：采集网络真实流量数据集，利用五元组将数据集分离为不同的TCP流，对TCP流的集合进行手工分类，使每一条TCP流都与一种协议类型相对应。Step 1, the collection, distribution and manual classification of real traffic data in the early stage: collect the real traffic data set of the network, use quintuple to separate the data set into different TCP streams, and manually classify the collection of TCP streams, so that each TCP stream Both correspond to a protocol type.

步骤2，提取前期TCP流集合的若干个包特征：提取每一条TCP流中关于数据包的特征，并按照数据包在该TCP流的先后顺序构建初步特征序列，然后根据特征选择算法对初步提取的包特征进行处理，筛选出最能体现流类别特性的包特征并形成最终特征序列。Step 2, extract several packet features of the previous TCP flow set: extract the characteristics of the data packets in each TCP flow, and construct a preliminary feature sequence according to the order of the data packets in the TCP flow, and then use the feature selection algorithm to initially extract The packet features are processed, and the packet features that best reflect the characteristics of the flow category are screened out to form the final feature sequence.

步骤3，决策树分类模型的建立：对步骤2构成的最终特征序列，利用决策树算法进行建树。Step 3, establishment of a decision tree classification model: use the decision tree algorithm to build a tree for the final feature sequence formed in step 2.

步骤4，对步骤3中建立的决策树进行数据结构转换并存储到硬件设备(如FPGA、CPLD、ASIC等)的存储设备(如RAM、ROM、FLASH等)中：通过对决策树的遍历，一方面提取决策树的中间节点值，对同一属性的各中间节点值进行从小到大的排序，然后对所有属性的各个中间节点值按顺序进行从小到大的编码，另一方面提取决策树的边缘节点值，对边缘节点值同样也进行编码，边缘节点值的编码是一个范围，取决于到达该边缘节点所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的存储设备(如RAM、ROM、FLASH等)中。其中在流量分类装置用于网络流量在线分类之前，以离线处理的方式建立决策树并对决策树进行数据结构转换。Step 4, convert the data structure of the decision tree established in step 3 and store it in the storage device (such as RAM, ROM, FLASH, etc.) of the hardware device (such as FPGA, CPLD, ASIC, etc.): by traversing the decision tree, On the one hand, extract the intermediate node values of the decision tree, sort the intermediate node values of the same attribute from small to large, and then encode the intermediate node values of all attributes in order from small to large, and on the other hand extract the decision tree. The edge node value also encodes the edge node value, and the encoding of the edge node value is a range, which depends on the encoding value of each intermediate node that reaches the edge node. Intermediate node values and their codes and edge node values and their codes are stored in two separate storage devices (such as RAM, ROM, FLASH, etc.). Wherein, before the traffic classification device is used for online classification of network traffic, a decision tree is established in an off-line processing manner and data structure conversion is performed on the decision tree.

步骤5，对待分类的数据包进行分流及类别判断：根据五元组将数据包划分到不同的流并查找流信息表获取分类信息，流信息表用于记录流的五元组信息以及该条流的类别。流信息表仅需要保存已分类的流的记录，不需要保存未分类的流的记录，因此对流信息表进行查找时，若不存在记录则可以立即判断为未分类，从而节省查找时间。Step 5, divide the data packets to be classified and judge the category: divide the data packets into different flows according to the five-tuple and search the flow information table to obtain the classification information. The flow information table is used to record the five-tuple information of the flow and the The class of the stream. The flow information table only needs to save the records of classified flows, not the records of unclassified flows. Therefore, when searching the flow information table, if there is no record, it can be judged as unclassified immediately, thus saving search time.

步骤6，对当前数据包进行打标签处理并提取待分类TCP流的包特征：利用步骤5提取的类别信息对所有经过的数据包进行打标签处理，若数据包所属的流已经被分类，则打上相应的类别标签，若未分类，则按照一定的原则标记一个默认的标签，然后判断该数据包是否需要被提取包特征并做相应处理。在这里，包特征的提取与步骤2中采用的最终特征序列相对应，需要按包到达顺序进行提取，并构建待分类流的特征序列，包的顺序信息按照到达观测点的时间顺序进行排列，取三次握手的第一个请求包即Setup包作为该流的第一个包。待分类流的特征序列存储在参数表中，参数表的一条记录包括五元组、各个包特征值以及参数是否满的标志。已分类被打上正确标签的包与未分类打上默认标签且需要进行参数提取的包需要进行时钟同步处理，即插入相应级别的流水线以保证数据包传输路线上的FIFO不会出现溢出现象。Step 6: Tag the current data packet and extract the packet characteristics of the TCP flow to be classified: use the category information extracted in step 5 to tag all passing data packets, if the flow to which the data packet belongs has been classified, then Label the corresponding category. If it is not classified, mark a default label according to certain principles, and then judge whether the data packet needs to be extracted and processed accordingly. Here, the extraction of packet features corresponds to the final feature sequence used in step 2. It needs to be extracted according to the order of packet arrival, and the feature sequence of the flow to be classified is constructed. Take the first request packet of the three-way handshake, that is, the Setup packet as the first packet of the flow. The characteristic sequence of the flow to be classified is stored in the parameter table, and a record of the parameter table includes five tuples, characteristic values of each packet, and a flag indicating whether the parameter is full. Packets that have been classified and tagged correctly and unclassified packets that are tagged with a default tag and need to be parameterized need to be clock-synchronized, that is, a corresponding level of pipeline is inserted to ensure that the FIFO on the data packet transmission route does not overflow.

步骤7，决策树查找：利用步骤6所得的待分类流的特征序列对步骤4所得的两块存储设备(如RAM、ROM、FLASH等)进行查找，判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略，仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行比较所有属性的所有中间节点值，确定该流所属的所有中间节点编码值并合并为一个数据，第二个时钟周期利用前一个时钟周期的结果数据并行比较所有边缘节点的编码值，从而确定该流的类别。Step 7, decision tree search: use the characteristic sequence of the flow to be classified obtained in step 6 to search the two storage devices (such as RAM, ROM, FLASH, etc.) obtained in step 4, judge the category value of the TCP flow and update the flow information surface. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, the first clock cycle compares all intermediate node values of all attributes in parallel, determines the encoding values of all intermediate nodes to which the flow belongs and merges them into one data, and uses the result data of the previous clock cycle to compare all edge nodes in parallel in the second clock cycle The encoded value of the stream to determine the class of the stream.

一种基于硬件设备(如FPGA、CPLD、ASIC等)的流量在线分类装置，用于实现上述的利用决策树高速并行处理策略实现TCP流量在线分类的装置包括在线部分和离线部分，其中离线部分具有决策树建树模块、决策树结构转换模块、分类结果处理模块；在线部分包括顺序连接的MAC层处理模块一、数据包轮询管理模块、分流判断模块、流量信息提取及打标签模块、决策树查找模块、MAC层处理模块二。其中，离线部分的决策树结构转换模块与在线部分的决策树查找模块相连接，离线部分的分类结果处理模块与在线部分的决策树查找模块通过一个流信息表间接连接。在线部分采用流水线处理技术。A kind of traffic online classification device based on hardware equipment (such as FPGA, CPLD, ASIC etc.), is used to realize above-mentioned utilizing decision tree high-speed parallel processing strategy to realize the device of TCP traffic online classification including online part and offline part, wherein offline part has Decision tree building module, decision tree structure conversion module, classification result processing module; the online part includes sequentially connected MAC layer processing module 1, data packet polling management module, distribution judgment module, traffic information extraction and labeling module, decision tree search Module, MAC layer processing module two. Among them, the decision tree structure conversion module of the offline part is connected with the decision tree search module of the online part, and the classification result processing module of the offline part is indirectly connected with the decision tree search module of the online part through a flow information table. The online part adopts pipeline processing technology.

决策树建树模块，用于根据前期真实网络数据流量建立决策树模型。The decision tree building module is used to build a decision tree model based on the real network data traffic in the early stage.

决策树结构转换模块，用于将决策树模型的结构进行转换，使之成为易于硬件实现的另一种数据结构并存储到硬件设备(如FPGA、CPLD、ASIC等)的存储设备(如RAM、ROM、FLASH等)中：通过对决策树的遍历，一方面提取决策树的中间节点值，对同一属性的各中间节点值进行从小到大的排序，然后对所有属性的各个中间节点值按顺序进行从小到大的编码，另一方面提取决策树的边缘节点值，对边缘节点值同样也进行编码，边缘节点值的编码是一个范围，取决于到达该边缘节点所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的存储设备(如RAM、ROM、FLASH等)中。The decision tree structure conversion module is used to convert the structure of the decision tree model so that it becomes another data structure that is easy to be implemented by hardware and is stored in a storage device (such as RAM, ROM, FLASH, etc.): By traversing the decision tree, on the one hand, extract the intermediate node values of the decision tree, sort the intermediate node values of the same attribute from small to large, and then sort the intermediate node values of all attributes in order Carry out coding from small to large, on the other hand, extract the edge node value of the decision tree, and also encode the edge node value. The coding of the edge node value is a range, which depends on the coding of each intermediate node experienced by reaching the edge node value. Intermediate node values and their codes and edge node values and their codes are stored in two separate storage devices (such as RAM, ROM, FLASH, etc.).

MAC层处理模块如今有许多现成的IP核可以采用。There are many off-the-shelf IP cores that can be used in the MAC layer processing module.

数据包轮询管理模块，用于从N个数据包缓冲队列中读取数据包，此处采用轮询式访问每个输入队列，直到从一个队列读完一个完整的数据包才转到下一个队列。The data packet polling management module is used to read data packets from N data packet buffer queues. Here, polling is used to access each input queue until a complete data packet is read from a queue before going to the next one. queue.

分流判断模块，用于根据五元组将数据包划分到不同的流，判断该条流是否被分类，如果被分类，类别值是多少，并维护流信息表。流信息表用于记录流的五元组信息以及该条流的类别。流信息表仅由决策树查找模块进行更新处理，其他模块均不能对流信息表进行写操作。The distribution judging module is used to divide the data packets into different flows according to the quintuple, judge whether the flow is classified, and if so, what is the classification value, and maintain the flow information table. The flow information table is used to record the five-tuple information of the flow and the category of the flow. The flow information table is only updated by the decision tree search module, and other modules cannot write to the flow information table.

流量信息提取及打标签模块，用于对未分类的数据包进行流特征提取并对所有数据包进行打标签处理。该模块需要维护一张参数表，参数表记录各条流的五元组信息、特征值以及特征值是否满的标志。特征值满的流的参数信息被发送到决策树查找模块。The flow information extraction and labeling module is used to extract flow features from unclassified data packets and label all data packets. This module needs to maintain a parameter table, and the parameter table records the quintuple information, eigenvalues and signs indicating whether the eigenvalues are full or not for each stream. The parameter information of the flow with full feature value is sent to the decision tree lookup module.

决策树查找模块，用于利用流量信息提取及打标签模块发送的待分类流的特征序列对决策树结构转换模块所得的两块存储设备(如RAM、ROM、FLASH等)进行查找，判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略，仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行查找所有属性的所有中间节点值，确定该流所属的所有中间节点编码值并合并为一个数据，第二个时钟周期利用前一个时钟周期的结果数据并行查找所有边缘节点的编码值，从而确定该流的类别，并将分类结果发送至流信息表以对流信息表中的记录进行更新。The decision tree search module is used to search the two storage devices (such as RAM, ROM, FLASH, etc.) obtained by the decision tree structure conversion module by using the feature sequence of the stream to be classified sent by the flow information extraction and labeling module, and judge the TCP The category value of the flow and update the flow information table. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, in the first clock cycle, all intermediate node values of all attributes are searched in parallel, and all intermediate node encoding values to which the flow belongs are determined and combined into one data; in the second clock cycle, all edge nodes are searched in parallel using the result data of the previous clock cycle to determine the category of the flow, and send the classification result to the flow information table to update the records in the flow information table.

分类结果处理模块，用于对流量分类的结果进行汇总、处理及界面显示。The classification result processing module is used to summarize, process and display the traffic classification results.

因此，本发明提供的流量分类方法和装置，具有以下优点：对决策树的结构进行了转换，使之转换成一种易于硬件实现的数据结构，降低了算法本身的复杂度；决策树查找过程中使用了并行查找和流水线技术，提高了处理速度；选取的包特征提取过程简单，易于在线完成；利用了决策树本身所具有的准确度高、稳定性好的特点。总之，较低的算法复杂度、高效的硬件在线实现方式、准确稳定的流量分类结果构成了本发明的最大特色。Therefore, the traffic classification method and device provided by the present invention have the following advantages: the structure of the decision tree is converted into a data structure that is easy to implement in hardware, which reduces the complexity of the algorithm itself; The parallel search and pipeline technology are used to improve the processing speed; the feature extraction process of the selected package is simple and easy to complete online; the high accuracy and good stability of the decision tree itself are used. In a word, relatively low algorithm complexity, efficient hardware online implementation, and accurate and stable flow classification results constitute the greatest features of the present invention.

附图说明Description of drawings

为了更清楚地说明本发明，下面将对本发明实施例描述中所需要使用的附图作简单的介绍，显然地，下面描述中的附图仅仅是本发明的流量分类方法流程图、流量分类装置的结构示意图，对于本领域普通技术人员来讲，在不付出创造性劳动前提下，还可以根据这些附图获得的更多的附图。In order to illustrate the present invention more clearly, the accompanying drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the accompanying drawings in the following description are only the flow chart of the flow classification method and the flow classification device of the present invention. The structure schematic diagram of , for those of ordinary skill in the art, more drawings can also be obtained based on these drawings without creative work.

图1是本发明一个实施例提供的流量分类方法流程图；FIG. 1 is a flowchart of a flow classification method provided by an embodiment of the present invention;

图2是本发明一个实施例提供的流量分类装置的结构示意图；Fig. 2 is a schematic structural diagram of a flow classification device provided by an embodiment of the present invention;

具体实施方式Detailed ways

下面将结合本发明实施例的附图，对本发明实施例中的技术方案和装置进行清楚、完整地描述。此具体实施方式以决策树中C4.5算法为例进行详细描述，但是此方法同样适适用于其它决策树算法。此实施例基于FPGA实现，采用的存储设备是RAM，同样适用于其他硬件设备(如FPGA、CPLD、ASIC等)及存储设备(如RAM、ROM、FLASH等)。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The following will clearly and completely describe the technical solutions and devices in the embodiments of the present invention with reference to the accompanying drawings of the embodiments of the present invention. This specific implementation is described in detail by taking the C4.5 algorithm in the decision tree as an example, but this method is also applicable to other decision tree algorithms. This embodiment is realized based on FPGA, and the storage device adopted is RAM, which is also applicable to other hardware devices (such as FPGA, CPLD, ASIC, etc.) and storage devices (such as RAM, ROM, FLASH, etc.). Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明一个实施例提供的流量分类方法的流程图，该方法包括：Fig. 1 is a flowchart of a flow classification method provided by an embodiment of the present invention, the method comprising:

S101：在不同时间从不同地点收集多个网络流量数据集，并对数据集进行分流和手工分类。S101: Collect multiple network traffic data sets from different locations at different times, and perform splitting and manual classification on the data sets.

网络流量分类装置一般部署在一定的网络中。决策树算法需要利用网络中的真实流量进行训练，以构建决策树分类模型，因此需要在准备部署的网络中设置网络探针，以便从网络中采集真实流量。上述的真实流量数据集包括用于手工分类确定流量协议类型所需的信息，以及数据包长、包间隔时间、包方向等后续步骤所需的特征参数；从传输层协议上看，数据流中至少包含TCP流和UDP流。需要把流量数据集按照{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将所得流量数据集分离为不同的TCP流，这样流量数据集就变为了TCP流的集合；其中，TCP流的头部的判断依据可以使用但不限于TCP流中的Setup、Setup/ACK、ACK数据包；并且一个数据流中数据包必须按照达到观测点的先后顺序排列。采用载荷分析等方法，以离线方式获得TCP数据流的协议类型，如WWW、MAIL、FTP、P2P等。The network traffic classification device is generally deployed in a certain network. The decision tree algorithm needs to use real traffic in the network for training to build a decision tree classification model. Therefore, it is necessary to set up network probes in the network to be deployed to collect real traffic from the network. The above-mentioned real traffic data set includes the information needed to manually classify and determine the type of traffic protocol, as well as the characteristic parameters required for subsequent steps such as data packet length, packet interval time, and packet direction; from the perspective of the transport layer protocol, in the data flow Contains at least TCP stream and UDP stream. The traffic data set needs to be separated into different TCP streams according to the five-tuple of {source address, destination address, source port, destination port, transport layer protocol type}, so that the traffic data set becomes a collection of TCP streams ; Among them, the basis for judging the head of the TCP flow can be but not limited to the Setup, Setup/ACK, and ACK data packets in the TCP flow; and the data packets in a data flow must be arranged in the order of reaching the observation point. Use load analysis and other methods to obtain the protocol type of TCP data flow in an offline manner, such as WWW, MAIL, FTP, P2P, etc.

S102：提取前期TCP流集合的若干个包特征。S102: Extract several packet features of the previous TCP flow set.

提取每一条TCP流中关于数据包的特征，即提取每条流头部10个包的包长、包间隔时间以及包传送方向。在本步骤，提取的每条流的包特征有三大类，一类是包长属性，一类是包间隔属性、另一类是包方向。也就是第一个包的包长、第二个包的包长、第三个包的包长……第十个包的包长；第一个包与第零个包的间隔时间(为0)、第二个包与第一个包的间隔时间、第三个包与第二个包的间隔时间…….第十个包与第九个包的间隔时间；第一个包的方向、第二个包的方向、第三个包的方向……第十个包的方向。并按照数据包在该TCP流的先后顺序构建初步特征序列，然后利用特征选择算法对包特征进行筛选，得到最终特征序列。Extract the characteristics of data packets in each TCP flow, that is, extract the packet length, packet interval time and packet transmission direction of the 10 packets at the head of each flow. In this step, there are three types of extracted packet features of each flow, one is the packet length attribute, the other is the packet interval attribute, and the other is the packet direction. That is, the packet length of the first packet, the packet length of the second packet, the packet length of the third packet...the packet length of the tenth packet; the interval time between the first packet and the zeroth packet (0 ), the interval between the second packet and the first packet, the interval between the third packet and the second packet... the interval between the tenth packet and the ninth packet; the direction of the first packet, The direction of the second bag, the direction of the third bag...the direction of the tenth bag. A preliminary feature sequence is constructed according to the order of the data packets in the TCP flow, and then the feature selection algorithm is used to screen the packet features to obtain the final feature sequence.

S103：建立决策树分类模型。S103: Establish a decision tree classification model.

利用高级编程语言如Java、Matlab等或直接利用机器学习Weka软件，根据前期真实网络数据流量，基于经典的C4.5算法建立决策树模型。Use high-level programming languages such as Java, Matlab, etc. or directly use machine learning Weka software to establish a decision tree model based on the classic C4.5 algorithm according to the real network data traffic in the early stage.

S104：对决策树进行结构转换并存储到FPGA的RAM中。S104: Transform the structure of the decision tree and store it in the RAM of the FPGA.

通过对决策树的遍历，一方面提取决策树的中间节点值，即对所有的路径进行从上到下的遍历及中间节点值的提取。一条路径上可能有基于不同属性的中间节点，同一种属性的中间节点也可能存在于不同的路径上。对一棵树的所有中间节点值进行提取后汇总结果，然后对同一属性的各中间节点值进行从小到大的排序，然后对所有属性的各个中间节点值按顺序进行从小到大的编码。例如，若包长属性的中间节点值从小到大分别为70、80、90、100、110，编码位数为3，则中间节点值70编码为000，中间节点值80编码为001，中间节点值90编码为010，中间节点值100编码为011，中间节点值110编码为100；包间隔属性及包方向属性中间节点值的编码与此类似。另一方面提取决策树的边缘节点值，对边缘节点值同样也进行编码，边缘节点值的编码是一个范围，取决于到达该边缘节点的路径所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的RAM中。其中在流量分类装置用于网络流量在线分类之前，以离线处理的方式建立决策树并对决策树进行数据结构转换。Through the traversal of the decision tree, on the one hand, the intermediate node values of the decision tree are extracted, that is, all paths are traversed from top to bottom and the intermediate node values are extracted. There may be intermediate nodes based on different attributes on a path, and intermediate nodes of the same attribute may also exist on different paths. Extract all intermediate node values of a tree and summarize the results, then sort the intermediate node values of the same attribute from small to large, and then encode the intermediate node values of all attributes in ascending order. For example, if the intermediate node values of the packet length attribute are 70, 80, 90, 100, and 110 from small to large, and the number of encoding digits is 3, then the intermediate node value 70 is encoded as 000, the intermediate node value 80 is encoded as 001, and the intermediate node value The value 90 is encoded as 010, the intermediate node value 100 is encoded as 011, and the intermediate node value 110 is encoded as 100; the encoding of the intermediate node value of the packet interval attribute and packet direction attribute is similar. On the other hand, the edge node value of the decision tree is extracted, and the edge node value is also encoded. The encoding of the edge node value is a range, which depends on the encoding value of each intermediate node experienced by the path to the edge node. Intermediate node values and their codes and edge node values and their codes are stored in two separate RAMs. Wherein, before the traffic classification device is used for online classification of network traffic, a decision tree is established in an off-line processing manner and data structure conversion is performed on the decision tree.

两块RAM的存储结构如表1、表2所示。表中一行为一个记录，每一个记录存储一个中间节点值及该中间节点值的编码值。其中，n1表示第1个属性的中间节点总数，n2表示第2个属性的中间节点总数，nk表示第k个属性的中间节点总数，m表示协议类型的总数。The storage structures of the two RAMs are shown in Table 1 and Table 2. A row in the table is a record, and each record stores an intermediate node value and an encoded value of the intermediate node value. Among them, n1 represents the total number of intermediate nodes of the first attribute, n2 represents the total number of intermediate nodes of the second attribute, nk represents the total number of intermediate nodes of the k-th attribute, and m represents the total number of protocol types.

表1中间节点值及其编码值在RAM中的存储方式Table 1 Storage method of intermediate node value and its encoded value in RAM

中间节点值intermediate node value 中间节点编码值Intermediate node code value 第1个属性第1个中间节点值The value of the first intermediate node of the first attribute 第1个属性第1个中间节点编码值The encoding value of the first intermediate node of the first attribute 第1个属性第2个中间节点值The value of the first attribute and the second intermediate node 第1个属性第2个中间节点编码值The encoding value of the first attribute and the second intermediate node …... …... 第1个属性第n1个中间节点值The first attribute is the value of the n1th intermediate node 第1个属性第n1个中间节点编码值The encoding value of the n1th intermediate node of the first attribute 第2个属性第1个中间节点值The value of the first intermediate node of the second attribute 第2个属性第1个中间节点编码值The encoding value of the first intermediate node of the second attribute 第2个属性第2个中间节点值The value of the second intermediate node of the second attribute 第2个属性第2个中间节点编码值The code value of the second intermediate node of the second attribute …... …... 第2个属性第n2个中间节点值The second attribute is the value of the n2th intermediate node 第2个属性第n2个中间节点编码值The second attribute is the code value of the n2th intermediate node ………………………………………… ………………………..……………………….. 第k个属性第1个中间节点值The first intermediate node value of the kth attribute 第k个属性第1个中间节点编码值The code value of the first intermediate node of the kth attribute 第k个属性第2个中间节点值The second intermediate node value of the kth attribute 第k个属性第2个中间节点编码值The encoding value of the second intermediate node of the kth attribute …... …... 第k个属性第nk个中间节点值The kth attribute is the value of the nkth intermediate node 第k个属性第nk个中间节点编码值The encoding value of the kth attribute and the nkth intermediate node

表2节点值及其编码值在RAM中的存储方式Table 2 The storage method of node value and its encoded value in RAM

节点值(即类别值)Node value (i.e. category value) 节点编码值Node code value 第1种协议类型值The first protocol type value 第1种协议类型编码值范围The first protocol type encoding value range 第2种协议类型值The second protocol type value 第2种协议类型编码值范围The second protocol type encoding value range ………………………………………… …………………….……………………. 第m种协议类型值The value of the mth protocol type 第m种协议类型编码值范围Range of encoding values for the mth protocol type

S105：对待分类的数据包进行分流S105: Distribute the data packets to be classified

根据{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将数据包划分到不同的流并维护流信息表，流信息表用于记录流的五元组信息以及该条流的类别。在此，仅对语义完整的TCP流进行分析。以TCP的3次握手为流的开始，以TCP的FIN＝1或RST＝1作为流的结束。根据网络中报文的五元组信息{源地址、目的地址、源端口、目的端口、传输层协议类型}判断是否为一条流。若五元组相同，则属于同一个流。否则，为不同的流。其中，若两个包的源地址相同，则属于同向网络流；若源地址与目的地址相同，则属于反向网络流；并约定，以第一个报文的转发方向为该网络流的上行方向。此外，若两个报文间隔超过一定时间，则属于不同的网络流。流信息表中每一条记录包括如下内容：标识一条流的ID、{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组、识别出来的协议类型。流信息表仅需要保存已分类的流的记录，不需要保存未分类的流的记录，因此对流信息表进行查找时，若不存在记录则可以立即判断为未分类，从而节省查找时间。According to the five-tuple of {source address, destination address, source port, destination port, transport layer protocol type}, the data packet is divided into different flows and the flow information table is maintained. The flow information table is used to record the five-tuple information of the flow and the The category of the stream. Here, only semantically complete TCP streams are analyzed. The flow begins with TCP's 3-way handshake, and ends with TCP's FIN=1 or RST=1. According to the five-tuple information {source address, destination address, source port, destination port, transport layer protocol type} of the packet in the network, it is judged whether it is a flow. If the quintuples are the same, they belong to the same stream. Otherwise, a different stream. Among them, if the source addresses of the two packets are the same, they belong to the same network flow; if the source address and the destination address are the same, they belong to the reverse network flow; and it is agreed that the forwarding direction of the first message is the network flow up direction. In addition, if the interval between two packets exceeds a certain period of time, they belong to different network flows. Each record in the flow information table includes the following contents: ID for identifying a flow, {source address, destination address, source port, destination port, transport layer protocol type} five-tuple, and the identified protocol type. The flow information table only needs to save the records of classified flows, not the records of unclassified flows. Therefore, when searching the flow information table, if there is no record, it can be judged as unclassified immediately, thus saving search time.

S106：判断该数据包所属的TCP流是否已分类S106: Determine whether the TCP stream to which the data packet belongs has been classified

利用S105提取的数据包的五元组信息，对流信息表进行查找，看表中是否已存在该五元组代表的流所对应的记录，如果存在记录，则读出该条流的类别值，如果不存在记录，则该条流未被分类。Utilize the quintuple information of the data packet that S105 extracts, search the flow information table, see whether the record corresponding to the flow that the quintuple represents already exists in the table, if there is a record, then read the category value of this flow, If no record exists, the flow is not classified.

S107：对已分类的数据包打上正确标签S107: Labeling the classified data packets correctly

利用步骤S106获取的类别信息对所有经过的数据包进行打标签处理，若数据包所属的流已经被分类，则打上相应的类别标签，分类结束。Use the category information acquired in step S106 to label all passing data packets. If the flow to which the data packet belongs has been classified, label the corresponding category, and the classification ends.

S108：对未分类的数据包打上默认标签并提取待分类TCP流的包特征S108: Put a default label on the unclassified data packet and extract the packet characteristics of the TCP flow to be classified

对于未分类的数据包，按照一定的原则标记一个默认的标签，然后判断该数据包是否需要被提取包特征并做相应处理。在这里，包特征的提取与S102中采用的最终特征序列相对应，即提取第某个包或某些包的某个属性或某些属性是与S102中的最终特征序列相一致的，需要按包到达顺序进行提取，并构建待分类流的特征序列。与流信息表类似，流量信息提取模块也要维护一张参数表，参数表中每一条记录包括如下内容：标识一条流的ID、源地址、目的地址、源端口、目的端口、传输层协议类型}五元组、某个包与前一个包的间隔时间、某个包的包长、某个包的包方向、该条流参数是否已满标志。For unclassified data packets, mark a default label according to certain principles, and then judge whether the data packet needs to be extracted and processed accordingly. Here, the extraction of package features corresponds to the final feature sequence used in S102, that is, the extraction of a certain attribute or certain attributes of the first package or some packages is consistent with the final feature sequence in S102, and needs to be pressed according to Packet arrival sequence is extracted, and the feature sequence of the flow to be classified is constructed. Similar to the flow information table, the traffic information extraction module also maintains a parameter table. Each record in the parameter table includes the following content: ID, source address, destination address, source port, destination port, transport layer protocol type identifying a flow } quintuple, the interval between a certain packet and the previous packet, the packet length of a certain packet, the packet direction of a certain packet, and whether the flow parameter is full or not.

网络数据(即数据包传输过程中的帧)传输是不受影响的，原因在于流量信息提取模块像一颗数据探针，仅仅将路过该模块的参数信息拷贝出来，而不改变任何数据以及数据的传输时序。The transmission of network data (that is, frames in the process of data packet transmission) is not affected, because the traffic information extraction module is like a data probe, which only copies the parameter information passing through the module without changing any data and data transmission timing.

S109，决策树查找。S109, decision tree search.

利用S108所得的待分类流的特征序列对S104所得的两块RAM进行查找，判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略，仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行比较所有属性的所有中间节点值，确定该流所属的所有中间节点编码值。也就是说，第一个时钟周期需要完成第1个属性的n1个中间节点值的比较以确定第1个属性所属的中间节点值范围区间，完成第2个属性的n2个中间节点值的比较以确定第2个属性所属的中间节点值范围区间……完成第k个属性的nk个中间节点值的比较以确定第k个属性所属的边缘节点值范围区间，而这n1+n2+…+nk个比较器是同时并行开始执行的。第一个时钟周期结束后，即可确定该流所属的所有属性的中间节点范围，通过RAM中的记录同时可以确定该流所属的所有中间节点编码值，其中一个属性对应一个中间节点编码值，则一条流对应k个中间节点编码值，将这k个中间节点编码值合并为一个数据，第二个时钟周期利用前一个时钟周期的合并结果数据并行比较所有边缘节点编码值，从而确定该流的边缘节点值，也就是协议类别值。Use the feature sequence of the flow to be classified obtained in S108 to search the two blocks of RAM obtained in S104, determine the category value of the TCP flow and update the flow information table. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, in the first clock cycle, all intermediate node values of all attributes are compared in parallel, and all intermediate node encoding values to which the stream belongs are determined. That is to say, the first clock cycle needs to complete the comparison of n1 intermediate node values of the first attribute to determine the range of intermediate node values to which the first attribute belongs, and complete the comparison of n2 intermediate node values of the second attribute To determine the value range interval of the intermediate node to which the second attribute belongs...Complete the comparison of nk intermediate node values of the kth attribute to determine the value range interval of the edge node to which the kth attribute belongs, and this n1+n2+...+nk Comparators are executed in parallel at the same time. After the end of the first clock cycle, the intermediate node ranges of all the attributes to which the stream belongs can be determined, and the code values of all intermediate nodes to which the stream belongs can be determined at the same time through the records in RAM, one of which corresponds to an intermediate node code value, Then a flow corresponds to k intermediate node encoding values, and the k intermediate node encoding values are combined into one data, and the second clock cycle uses the merged result data of the previous clock cycle to compare all edge node encoding values in parallel to determine the stream The edge node value of , that is, the protocol category value.

图2为本发明所提供的流量分类装置的结构示意图。Fig. 2 is a schematic structural diagram of the flow classification device provided by the present invention.

从功能上看，该流量分类装置可以分为在线和离线两个部分。离线部分主要完成决策树的构造及数据结构转换；在线部分主要负责未知数据流的分类。离线部分包括顺序连接的前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206以及后期的分类结果处理模块207；在线部分包括顺序连接的MAC层处理模块一211、数据包轮询管理模块212、分流判断模块213、流量信息提取及打标签模块214、决策树查找模块215、MAC层处理模块二216。From a functional point of view, the flow classification device can be divided into two parts: online and offline. The offline part mainly completes the construction of the decision tree and the data structure conversion; the online part is mainly responsible for the classification of unknown data streams. The offline part includes sequentially connected pre-data flow collection module 201, pre-data stream shunting module 202, pre-data stream manual classification module 203, pre-data stream feature extraction module 204, decision tree building module 205, decision tree structure conversion module 206 and later classification result processing module 207; the online part includes sequentially connected MAC layer processing module one 211, data packet polling management module 212, distribution judgment module 213, traffic information extraction and labeling module 214, decision tree search module 215, MAC layer Process Module II 216.

在本流量分类装置中，前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206可在装置部署前完成，因此不是使用流量分类的装置或者系统的必要组成部分。而MAC层处理模块一211、数据包轮询管理模块212、分流判断模块213、流量信息提取及打标签模块214、决策树查找模块215、MAC层处理模块二216、分类结果处理模块207一般应在使用流量分类的装置或系统中出现。In this traffic classification device, the pre-data flow collection module 201, the pre-data stream distribution module 202, the pre-data stream manual classification module 203, the pre-data stream feature extraction module 204, the decision tree building module 205, and the decision tree structure conversion module 206 can be It is done before the device is deployed and therefore is not a required part of the device or system that uses traffic classification. And MAC layer processing module one 211, data packet polling management module 212, distribution judgment module 213, traffic information extraction and labeling module 214, decision tree search module 215, MAC layer processing module two 216, classification result processing module 207 generally should Occurs in devices or systems that use traffic classification.

每一模块具体功能和处理流程如下：在带有流量分类的装置或系统使用前，需要使用前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206完成图1中S101～S104的工作，形成的经过转换的决策树数据结构置于装置中的RAM中。当一个未知类别的数据包进入流量分类装置后，而MAC层处理模块一211、数据包轮询管理模块212对数据包进行预处理，分流判断模块213根据{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将数据包划分到不同的流并维护流信息表，然后对流信息表进行查找以确定数据包所属的类别，完成图1中S105～S106的工作。流量信息提取及打标签模块214根据分流判断模块213所获取的类别信息对数据包进行打标签处理，同时按数据包先后顺序，依次提取包长、修正包间隔时间、传送方向等参数，形成特征序列，送入决策树查找模块215、完成图1中S107～S109的工作。流信息表仅由决策树查找模块215进行更新处理，其他模块均不能对流信息表进行写操作。MAC层处理模块二216及分类结果处理模块207对数据包进行后续的处理并显示分类结果。The specific functions and processing flow of each module are as follows: Before the device or system with traffic classification is used, it is necessary to use the early data flow collection module 201, the early data flow distribution module 202, the early data flow manual classification module 203, and the early data flow The feature extraction module 204, the decision tree building module 205, and the decision tree structure conversion module 206 complete the work of S101-S104 in FIG. 1, and the formed converted decision tree data structure is placed in the RAM of the device. After a data packet of an unknown category enters the traffic classification device, the MAC layer processing module one 211 and the data packet polling management module 212 preprocess the data packet, and the distribution judging module 213 is based on {source address, destination address, source port, Destination port, transport layer protocol type} quintuple divides the data packet into different flows and maintains the flow information table, and then searches the flow information table to determine the category to which the data packet belongs, and completes the work of S105-S106 in Figure 1. The traffic information extraction and labeling module 214 performs labeling processing on the data packets according to the type information obtained by the distribution judging module 213, and simultaneously extracts parameters such as packet length, corrected packet interval time, and transmission direction according to the sequence of the data packets to form a feature The sequence is sent to the decision tree search module 215 to complete the work of S107-S109 in FIG. 1 . The flow information table is only updated by the decision tree search module 215, and other modules cannot write to the flow information table. The second MAC layer processing module 216 and the classification result processing module 207 perform subsequent processing on the data packet and display the classification result.

本实施例提供的方法和装置，对采用C4.5算法建立的决策树进行了数据结构转换，使之转换成一种易于硬件实现的数据结构，降低了算法本身的复杂度；决策树查找过程中使用了并行查找和流水线技术，提高了处理速度；选取的包特征提取过程简单，易于在线完成；利用了C4.5算法本身所具有的准确度高、稳定性好的特点。因此，本实施例可以方便的实现网络流量的高速在线分类。The method and device provided by this embodiment convert the data structure of the decision tree established by the C4.5 algorithm, so that it is converted into a data structure that is easy to implement in hardware, reducing the complexity of the algorithm itself; Using parallel search and pipeline technology, the processing speed is improved; the feature extraction process of the selected package is simple and easy to complete online; the C4.5 algorithm itself has been used for its high accuracy and good stability. Therefore, this embodiment can conveniently implement high-speed online classification of network traffic.

最后应说明的是：以上实施例仅用以说明本发明的技术方案及装置，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解；其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions and devices of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand; It is still possible to modify the technical solutions described in the foregoing embodiments, or to perform equivalent replacements on the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention. .