Movatterモバイル変換


[0]ホーム

URL:


CN102523241B - Method and device for online classification of network traffic based on high-speed parallel processing of decision trees - Google Patents

Method and device for online classification of network traffic based on high-speed parallel processing of decision trees
Download PDF

Info

Publication number
CN102523241B
CN102523241BCN201210006268.7ACN201210006268ACN102523241BCN 102523241 BCN102523241 BCN 102523241BCN 201210006268 ACN201210006268 ACN 201210006268ACN 102523241 BCN102523241 BCN 102523241B
Authority
CN
China
Prior art keywords
decision tree
flow
module
data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210006268.7A
Other languages
Chinese (zh)
Other versions
CN102523241A (en
Inventor
顾仁涛
许艳红
纪越峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and TelecommunicationsfiledCriticalBeijing University of Posts and Telecommunications
Priority to CN201210006268.7ApriorityCriticalpatent/CN102523241B/en
Publication of CN102523241ApublicationCriticalpatent/CN102523241A/en
Application grantedgrantedCritical
Publication of CN102523241BpublicationCriticalpatent/CN102523241B/en
Expired - Fee Relatedlegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The invention relates to a method and a device for classifying network traffic on line based on decision tree high-speed parallel processing. The method comprises the following steps of: performing acquisition, distribution and manual classification on early real traffic data, extracting the packet characteristics of an early transmission control protocol (TCP) stream set, establishing a decision tree classification model, converting a data structure, performing distribution and class judgment on a data packet to be classified, tagging a current data packet, extracting the packet characteristics of a TCP stream to be classified, and searching for a decision tree. The device comprises a decision tree construction module, a structure conversion module, a classification result processing module, a medium access control (MAC) layer processing module, a data packet polling management module, a distribution judgment module, a traffic information extraction and tagging module, and a decision tree searching module. The method and the device are low in algorithm complexity and high in processing speed, classification accuracy and stability, and can be used for equipment and systems with requirements for online traffic classification in a high speed backbone network; and online classification can be realized.

Description

Translated fromChinese
基于决策树高速并行处理的网络流量在线分类方法及装置Method and device for online classification of network traffic based on high-speed parallel processing of decision trees

技术领域technical field

本发明涉及一种网络流量在线分类的方法及装置,尤其涉及一种基于决策树高速并行处理策略实现TCP流量在线分类的方法及装置,属于通信技术领域。The invention relates to a method and device for online classification of network traffic, in particular to a method and device for realizing online classification of TCP traffic based on a high-speed parallel processing strategy of a decision tree, and belongs to the field of communication technology.

背景技术Background technique

如今,网络技术的发展越来越迅速,基于网络的应用越来越多、越来越复杂。各种各样的应用不但抢占着越来越多的网络资源,而且也对QoS和网络安全带来了巨大的威胁。在这样的背景下,如何给广大的互联网用户提供一个安全、可靠、高效的使用环境,如何发现并避免网络的异常流量,是网络管理领域需要解决的重要问题。为了解决上述这些问题,网络研究人员提出了流量调度、容量规划等一系列策略来提高网络的运营效率。然而,无论是对现有网络进行扩容改造,还是进行QoS调度,都必须对网络流量中的各种应用(如P2P、Web、IM、视频流量等)进行准确的分类与识别。此外,在网络安全、流量计费、应用趋势分析等研究领域,准确的流量分类也具有极其重要的意义。有线宽带和3G/4G的迅速推广,使得流量分类这一有效进行网络精细化管理的工具更有广阔的应用前景。Nowadays, network technology is developing more and more rapidly, and more and more network-based applications are becoming more and more complex. Various applications not only occupy more and more network resources, but also pose a huge threat to QoS and network security. Under such a background, how to provide a safe, reliable and efficient environment for the majority of Internet users, and how to discover and avoid abnormal network traffic are important issues to be solved in the field of network management. In order to solve the above problems, network researchers have proposed a series of strategies such as traffic scheduling and capacity planning to improve network operation efficiency. However, it is necessary to accurately classify and identify various applications in network traffic (such as P2P, Web, IM, video traffic, etc.) whether it is expanding the existing network or performing QoS scheduling. In addition, accurate traffic classification is also of great significance in research fields such as network security, traffic billing, and application trend analysis. The rapid promotion of wired broadband and 3G/4G makes traffic classification, a tool for effective fine-grained network management, more broad application prospects.

传统的流量分类技术主要基于传输层的端口信息,然而近年来,在互联网网络带宽不断提高以及应用层协议逐渐复杂多样的趋势下,许多网络应用与端口的相关性越来越小,伪装端口以及动态端口等情况使得上述方法已经很难适应技术和应用的发展与需求,这就迫切需要引入新的理论和技术,深层次挖掘网络应用的内在特征。为了适应Internet流量数据庞大、应用属性动态变化的特点,利用机器学习方法处理流量分类问题成为当前网络测量领域内一个新兴的研究热点。例如:朴素贝叶斯算法、改进贝叶斯算法、决策树算法、KNN算法、支持向量机算法、神经网络算法以及各种聚类算法等等。基于机器学习的流量分类技术不依赖于传输层端口号或解析有效负载来识别网络应用,而是利用流量在传输过程中表现出来的流的各种统计特征如包长、包间隔时间等来识别网络应用,方法本身不受伪装端口、动态端口、有效负载加密甚至网络地址转换的影响,在分类性能和灵活性方面,较之前述各种方法都有所突破。The traditional traffic classification technology is mainly based on the port information of the transport layer. However, in recent years, with the continuous improvement of Internet network bandwidth and the trend of increasingly complex and diverse application layer protocols, many network applications have less and less correlation with ports. Masquerading ports and Dynamic ports and other situations make it difficult for the above methods to adapt to the development and needs of technology and applications, which urgently requires the introduction of new theories and technologies to deeply explore the inherent characteristics of network applications. In order to adapt to the characteristics of huge Internet traffic data and dynamic changes of application attributes, using machine learning methods to deal with traffic classification problems has become an emerging research hotspot in the field of network measurement. For example: naive Bayesian algorithm, improved Bayesian algorithm, decision tree algorithm, KNN algorithm, support vector machine algorithm, neural network algorithm and various clustering algorithms, etc. Machine learning-based traffic classification technology does not rely on transport layer port numbers or parsing payloads to identify network applications, but uses various statistical characteristics of the flow shown during traffic transmission, such as packet length, packet interval time, etc. For network applications, the method itself is not affected by masquerading ports, dynamic ports, payload encryption or even network address translation, and has breakthroughs in classification performance and flexibility compared with the above-mentioned methods.

然而,目前业界对流量分类技术的研究还远远无法满足业务发展的需求,主要体现在目前大多数技术都采用离线分类的手段,无法实现实时在线的分类。这就限制了流量分类技术在高速骨干网中的应用。However, the current research on traffic classification technology in the industry is still far from meeting the needs of business development. This is mainly reflected in the fact that most current technologies use offline classification methods, which cannot realize real-time online classification. This limits the application of traffic classification technology in high-speed backbone networks.

为了满足目前和未来高速骨干网的需要,流量分类技术迫切需要满足以下几点要求:1)分类准确性较高,避免采用端口或者净荷作为主要识别特征;2)算法复杂度较低,具体实现设计上要有并行化处理的特性,易于硬件实现(如FPGA、CPLD、ASIC等),保证网络流量的高速在线分类;3)分类稳定性较好,能够适用于复杂多变的网络环境。In order to meet the needs of current and future high-speed backbone networks, the traffic classification technology urgently needs to meet the following requirements: 1) The classification accuracy is high, and ports or payloads are avoided as the main identification features; 2) The algorithm complexity is low, and the specific The implementation design must have the characteristics of parallel processing, easy to implement in hardware (such as FPGA, CPLD, ASIC, etc.), and ensure high-speed online classification of network traffic; 3) The classification stability is good, and it can be applied to complex and changeable network environments.

发明内容Contents of the invention

本发明提供了一种基于决策树高速并行处理策略实现TCP流量在线分类的方法及装置,能够实现网络流量的高速实时在线分类,稳定性好,准确性高。The invention provides a method and device for realizing TCP flow online classification based on a decision tree high-speed parallel processing strategy, which can realize high-speed real-time online classification of network flow, and has good stability and high accuracy.

为实现上述的发明目的,本发明采用下述的技术方案:For realizing above-mentioned purpose of the invention, the present invention adopts following technical scheme:

一种基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于包括以下步骤:A method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy, is characterized in that comprising the following steps:

步骤1,前期真实流量数据的采集、分流及手工分类:采集网络真实流量数据集,利用五元组将数据集分离为不同的TCP流,对TCP流的集合进行手工分类,使每一条TCP流都与一种协议类型相对应。Step 1, the collection, distribution and manual classification of real traffic data in the early stage: collect the real traffic data set of the network, use quintuple to separate the data set into different TCP streams, and manually classify the collection of TCP streams, so that each TCP stream Both correspond to a protocol type.

步骤2,提取前期TCP流集合的若干个包特征:提取每一条TCP流中关于数据包的特征,并按照数据包在该TCP流的先后顺序构建初步特征序列,然后根据特征选择算法对初步提取的包特征进行处理,筛选出最能体现流类别特性的包特征并形成最终特征序列。Step 2, extract several packet features of the previous TCP flow set: extract the characteristics of the data packets in each TCP flow, and construct a preliminary feature sequence according to the order of the data packets in the TCP flow, and then use the feature selection algorithm to initially extract The packet features are processed, and the packet features that best reflect the characteristics of the flow category are screened out to form the final feature sequence.

步骤3,决策树分类模型的建立:对步骤2构成的最终特征序列,利用决策树算法进行建树。Step 3, establishment of a decision tree classification model: use the decision tree algorithm to build a tree for the final feature sequence formed in step 2.

步骤4,对步骤3中建立的决策树进行数据结构转换并存储到硬件设备(如FPGA、CPLD、ASIC等)的存储设备(如RAM、ROM、FLASH等)中:通过对决策树的遍历,一方面提取决策树的中间节点值,对同一属性的各中间节点值进行从小到大的排序,然后对所有属性的各个中间节点值按顺序进行从小到大的编码,另一方面提取决策树的边缘节点值,对边缘节点值同样也进行编码,边缘节点值的编码是一个范围,取决于到达该边缘节点所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的存储设备(如RAM、ROM、FLASH等)中。其中在流量分类装置用于网络流量在线分类之前,以离线处理的方式建立决策树并对决策树进行数据结构转换。Step 4, convert the data structure of the decision tree established in step 3 and store it in the storage device (such as RAM, ROM, FLASH, etc.) of the hardware device (such as FPGA, CPLD, ASIC, etc.): by traversing the decision tree, On the one hand, extract the intermediate node values of the decision tree, sort the intermediate node values of the same attribute from small to large, and then encode the intermediate node values of all attributes in order from small to large, and on the other hand extract the decision tree. The edge node value also encodes the edge node value, and the encoding of the edge node value is a range, which depends on the encoding value of each intermediate node that reaches the edge node. Intermediate node values and their codes and edge node values and their codes are stored in two separate storage devices (such as RAM, ROM, FLASH, etc.). Wherein, before the traffic classification device is used for online classification of network traffic, a decision tree is established in an off-line processing manner and data structure conversion is performed on the decision tree.

步骤5,对待分类的数据包进行分流及类别判断:根据五元组将数据包划分到不同的流并查找流信息表获取分类信息,流信息表用于记录流的五元组信息以及该条流的类别。流信息表仅需要保存已分类的流的记录,不需要保存未分类的流的记录,因此对流信息表进行查找时,若不存在记录则可以立即判断为未分类,从而节省查找时间。Step 5, divide the data packets to be classified and judge the category: divide the data packets into different flows according to the five-tuple and search the flow information table to obtain the classification information. The flow information table is used to record the five-tuple information of the flow and the The class of the stream. The flow information table only needs to save the records of classified flows, not the records of unclassified flows. Therefore, when searching the flow information table, if there is no record, it can be judged as unclassified immediately, thus saving search time.

步骤6,对当前数据包进行打标签处理并提取待分类TCP流的包特征:利用步骤5提取的类别信息对所有经过的数据包进行打标签处理,若数据包所属的流已经被分类,则打上相应的类别标签,若未分类,则按照一定的原则标记一个默认的标签,然后判断该数据包是否需要被提取包特征并做相应处理。在这里,包特征的提取与步骤2中采用的最终特征序列相对应,需要按包到达顺序进行提取,并构建待分类流的特征序列,包的顺序信息按照到达观测点的时间顺序进行排列,取三次握手的第一个请求包即Setup包作为该流的第一个包。待分类流的特征序列存储在参数表中,参数表的一条记录包括五元组、各个包特征值以及参数是否满的标志。已分类被打上正确标签的包与未分类打上默认标签且需要进行参数提取的包需要进行时钟同步处理,即插入相应级别的流水线以保证数据包传输路线上的FIFO不会出现溢出现象。Step 6: Tag the current data packet and extract the packet characteristics of the TCP flow to be classified: use the category information extracted in step 5 to tag all passing data packets, if the flow to which the data packet belongs has been classified, then Label the corresponding category. If it is not classified, mark a default label according to certain principles, and then judge whether the data packet needs to be extracted and processed accordingly. Here, the extraction of packet features corresponds to the final feature sequence used in step 2. It needs to be extracted according to the order of packet arrival, and the feature sequence of the flow to be classified is constructed. Take the first request packet of the three-way handshake, that is, the Setup packet as the first packet of the flow. The characteristic sequence of the flow to be classified is stored in the parameter table, and a record of the parameter table includes five tuples, characteristic values of each packet, and a flag indicating whether the parameter is full. Packets that have been classified and tagged correctly and unclassified packets that are tagged with a default tag and need to be parameterized need to be clock-synchronized, that is, a corresponding level of pipeline is inserted to ensure that the FIFO on the data packet transmission route does not overflow.

步骤7,决策树查找:利用步骤6所得的待分类流的特征序列对步骤4所得的两块存储设备(如RAM、ROM、FLASH等)进行查找,判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略,仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行比较所有属性的所有中间节点值,确定该流所属的所有中间节点编码值并合并为一个数据,第二个时钟周期利用前一个时钟周期的结果数据并行比较所有边缘节点的编码值,从而确定该流的类别。Step 7, decision tree search: use the characteristic sequence of the flow to be classified obtained in step 6 to search the two storage devices (such as RAM, ROM, FLASH, etc.) obtained in step 4, judge the category value of the TCP flow and update the flow information surface. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, the first clock cycle compares all intermediate node values of all attributes in parallel, determines the encoding values of all intermediate nodes to which the flow belongs and merges them into one data, and uses the result data of the previous clock cycle to compare all edge nodes in parallel in the second clock cycle The encoded value of the stream to determine the class of the stream.

一种基于硬件设备(如FPGA、CPLD、ASIC等)的流量在线分类装置,用于实现上述的利用决策树高速并行处理策略实现TCP流量在线分类的装置包括在线部分和离线部分,其中离线部分具有决策树建树模块、决策树结构转换模块、分类结果处理模块;在线部分包括顺序连接的MAC层处理模块一、数据包轮询管理模块、分流判断模块、流量信息提取及打标签模块、决策树查找模块、MAC层处理模块二。其中,离线部分的决策树结构转换模块与在线部分的决策树查找模块相连接,离线部分的分类结果处理模块与在线部分的决策树查找模块通过一个流信息表间接连接。在线部分采用流水线处理技术。A kind of traffic online classification device based on hardware equipment (such as FPGA, CPLD, ASIC etc.), is used to realize above-mentioned utilizing decision tree high-speed parallel processing strategy to realize the device of TCP traffic online classification including online part and offline part, wherein offline part has Decision tree building module, decision tree structure conversion module, classification result processing module; the online part includes sequentially connected MAC layer processing module 1, data packet polling management module, distribution judgment module, traffic information extraction and labeling module, decision tree search Module, MAC layer processing module two. Among them, the decision tree structure conversion module of the offline part is connected with the decision tree search module of the online part, and the classification result processing module of the offline part is indirectly connected with the decision tree search module of the online part through a flow information table. The online part adopts pipeline processing technology.

决策树建树模块,用于根据前期真实网络数据流量建立决策树模型。The decision tree building module is used to build a decision tree model based on the real network data traffic in the early stage.

决策树结构转换模块,用于将决策树模型的结构进行转换,使之成为易于硬件实现的另一种数据结构并存储到硬件设备(如FPGA、CPLD、ASIC等)的存储设备(如RAM、ROM、FLASH等)中:通过对决策树的遍历,一方面提取决策树的中间节点值,对同一属性的各中间节点值进行从小到大的排序,然后对所有属性的各个中间节点值按顺序进行从小到大的编码,另一方面提取决策树的边缘节点值,对边缘节点值同样也进行编码,边缘节点值的编码是一个范围,取决于到达该边缘节点所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的存储设备(如RAM、ROM、FLASH等)中。The decision tree structure conversion module is used to convert the structure of the decision tree model so that it becomes another data structure that is easy to be implemented by hardware and is stored in a storage device (such as RAM, ROM, FLASH, etc.): By traversing the decision tree, on the one hand, extract the intermediate node values of the decision tree, sort the intermediate node values of the same attribute from small to large, and then sort the intermediate node values of all attributes in order Carry out coding from small to large, on the other hand, extract the edge node value of the decision tree, and also encode the edge node value. The coding of the edge node value is a range, which depends on the coding of each intermediate node experienced by reaching the edge node value. Intermediate node values and their codes and edge node values and their codes are stored in two separate storage devices (such as RAM, ROM, FLASH, etc.).

MAC层处理模块如今有许多现成的IP核可以采用。There are many off-the-shelf IP cores that can be used in the MAC layer processing module.

数据包轮询管理模块,用于从N个数据包缓冲队列中读取数据包,此处采用轮询式访问每个输入队列,直到从一个队列读完一个完整的数据包才转到下一个队列。The data packet polling management module is used to read data packets from N data packet buffer queues. Here, polling is used to access each input queue until a complete data packet is read from a queue before going to the next one. queue.

分流判断模块,用于根据五元组将数据包划分到不同的流,判断该条流是否被分类,如果被分类,类别值是多少,并维护流信息表。流信息表用于记录流的五元组信息以及该条流的类别。流信息表仅由决策树查找模块进行更新处理,其他模块均不能对流信息表进行写操作。The distribution judging module is used to divide the data packets into different flows according to the quintuple, judge whether the flow is classified, and if so, what is the classification value, and maintain the flow information table. The flow information table is used to record the five-tuple information of the flow and the category of the flow. The flow information table is only updated by the decision tree search module, and other modules cannot write to the flow information table.

流量信息提取及打标签模块,用于对未分类的数据包进行流特征提取并对所有数据包进行打标签处理。该模块需要维护一张参数表,参数表记录各条流的五元组信息、特征值以及特征值是否满的标志。特征值满的流的参数信息被发送到决策树查找模块。The flow information extraction and labeling module is used to extract flow features from unclassified data packets and label all data packets. This module needs to maintain a parameter table, and the parameter table records the quintuple information, eigenvalues and signs indicating whether the eigenvalues are full or not for each stream. The parameter information of the flow with full feature value is sent to the decision tree lookup module.

决策树查找模块,用于利用流量信息提取及打标签模块发送的待分类流的特征序列对决策树结构转换模块所得的两块存储设备(如RAM、ROM、FLASH等)进行查找,判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略,仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行查找所有属性的所有中间节点值,确定该流所属的所有中间节点编码值并合并为一个数据,第二个时钟周期利用前一个时钟周期的结果数据并行查找所有边缘节点的编码值,从而确定该流的类别,并将分类结果发送至流信息表以对流信息表中的记录进行更新。The decision tree search module is used to search the two storage devices (such as RAM, ROM, FLASH, etc.) obtained by the decision tree structure conversion module by using the feature sequence of the stream to be classified sent by the flow information extraction and labeling module, and judge the TCP The category value of the flow and update the flow information table. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, in the first clock cycle, all intermediate node values of all attributes are searched in parallel, and all intermediate node encoding values to which the flow belongs are determined and combined into one data; in the second clock cycle, all edge nodes are searched in parallel using the result data of the previous clock cycle to determine the category of the flow, and send the classification result to the flow information table to update the records in the flow information table.

分类结果处理模块,用于对流量分类的结果进行汇总、处理及界面显示。The classification result processing module is used to summarize, process and display the traffic classification results.

因此,本发明提供的流量分类方法和装置,具有以下优点:对决策树的结构进行了转换,使之转换成一种易于硬件实现的数据结构,降低了算法本身的复杂度;决策树查找过程中使用了并行查找和流水线技术,提高了处理速度;选取的包特征提取过程简单,易于在线完成;利用了决策树本身所具有的准确度高、稳定性好的特点。总之,较低的算法复杂度、高效的硬件在线实现方式、准确稳定的流量分类结果构成了本发明的最大特色。Therefore, the traffic classification method and device provided by the present invention have the following advantages: the structure of the decision tree is converted into a data structure that is easy to implement in hardware, which reduces the complexity of the algorithm itself; The parallel search and pipeline technology are used to improve the processing speed; the feature extraction process of the selected package is simple and easy to complete online; the high accuracy and good stability of the decision tree itself are used. In a word, relatively low algorithm complexity, efficient hardware online implementation, and accurate and stable flow classification results constitute the greatest features of the present invention.

附图说明Description of drawings

为了更清楚地说明本发明,下面将对本发明实施例描述中所需要使用的附图作简单的介绍,显然地,下面描述中的附图仅仅是本发明的流量分类方法流程图、流量分类装置的结构示意图,对于本领域普通技术人员来讲,在不付出创造性劳动前提下,还可以根据这些附图获得的更多的附图。In order to illustrate the present invention more clearly, the accompanying drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the accompanying drawings in the following description are only the flow chart of the flow classification method and the flow classification device of the present invention. The structure schematic diagram of , for those of ordinary skill in the art, more drawings can also be obtained based on these drawings without creative work.

图1是本发明一个实施例提供的流量分类方法流程图;FIG. 1 is a flowchart of a flow classification method provided by an embodiment of the present invention;

图2是本发明一个实施例提供的流量分类装置的结构示意图;Fig. 2 is a schematic structural diagram of a flow classification device provided by an embodiment of the present invention;

具体实施方式Detailed ways

下面将结合本发明实施例的附图,对本发明实施例中的技术方案和装置进行清楚、完整地描述。此具体实施方式以决策树中C4.5算法为例进行详细描述,但是此方法同样适适用于其它决策树算法。此实施例基于FPGA实现,采用的存储设备是RAM,同样适用于其他硬件设备(如FPGA、CPLD、ASIC等)及存储设备(如RAM、ROM、FLASH等)。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明的保护范围。The following will clearly and completely describe the technical solutions and devices in the embodiments of the present invention with reference to the accompanying drawings of the embodiments of the present invention. This specific implementation is described in detail by taking the C4.5 algorithm in the decision tree as an example, but this method is also applicable to other decision tree algorithms. This embodiment is realized based on FPGA, and the storage device adopted is RAM, which is also applicable to other hardware devices (such as FPGA, CPLD, ASIC, etc.) and storage devices (such as RAM, ROM, FLASH, etc.). Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

图1为本发明一个实施例提供的流量分类方法的流程图,该方法包括:Fig. 1 is a flowchart of a flow classification method provided by an embodiment of the present invention, the method comprising:

S101:在不同时间从不同地点收集多个网络流量数据集,并对数据集进行分流和手工分类。S101: Collect multiple network traffic data sets from different locations at different times, and perform splitting and manual classification on the data sets.

网络流量分类装置一般部署在一定的网络中。决策树算法需要利用网络中的真实流量进行训练,以构建决策树分类模型,因此需要在准备部署的网络中设置网络探针,以便从网络中采集真实流量。上述的真实流量数据集包括用于手工分类确定流量协议类型所需的信息,以及数据包长、包间隔时间、包方向等后续步骤所需的特征参数;从传输层协议上看,数据流中至少包含TCP流和UDP流。需要把流量数据集按照{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将所得流量数据集分离为不同的TCP流,这样流量数据集就变为了TCP流的集合;其中,TCP流的头部的判断依据可以使用但不限于TCP流中的Setup、Setup/ACK、ACK数据包;并且一个数据流中数据包必须按照达到观测点的先后顺序排列。采用载荷分析等方法,以离线方式获得TCP数据流的协议类型,如WWW、MAIL、FTP、P2P等。The network traffic classification device is generally deployed in a certain network. The decision tree algorithm needs to use real traffic in the network for training to build a decision tree classification model. Therefore, it is necessary to set up network probes in the network to be deployed to collect real traffic from the network. The above-mentioned real traffic data set includes the information needed to manually classify and determine the type of traffic protocol, as well as the characteristic parameters required for subsequent steps such as data packet length, packet interval time, and packet direction; from the perspective of the transport layer protocol, in the data flow Contains at least TCP stream and UDP stream. The traffic data set needs to be separated into different TCP streams according to the five-tuple of {source address, destination address, source port, destination port, transport layer protocol type}, so that the traffic data set becomes a collection of TCP streams ; Among them, the basis for judging the head of the TCP flow can be but not limited to the Setup, Setup/ACK, and ACK data packets in the TCP flow; and the data packets in a data flow must be arranged in the order of reaching the observation point. Use load analysis and other methods to obtain the protocol type of TCP data flow in an offline manner, such as WWW, MAIL, FTP, P2P, etc.

S102:提取前期TCP流集合的若干个包特征。S102: Extract several packet features of the previous TCP flow set.

提取每一条TCP流中关于数据包的特征,即提取每条流头部10个包的包长、包间隔时间以及包传送方向。在本步骤,提取的每条流的包特征有三大类,一类是包长属性,一类是包间隔属性、另一类是包方向。也就是第一个包的包长、第二个包的包长、第三个包的包长……第十个包的包长;第一个包与第零个包的间隔时间(为0)、第二个包与第一个包的间隔时间、第三个包与第二个包的间隔时间…….第十个包与第九个包的间隔时间;第一个包的方向、第二个包的方向、第三个包的方向……第十个包的方向。并按照数据包在该TCP流的先后顺序构建初步特征序列,然后利用特征选择算法对包特征进行筛选,得到最终特征序列。Extract the characteristics of data packets in each TCP flow, that is, extract the packet length, packet interval time and packet transmission direction of the 10 packets at the head of each flow. In this step, there are three types of extracted packet features of each flow, one is the packet length attribute, the other is the packet interval attribute, and the other is the packet direction. That is, the packet length of the first packet, the packet length of the second packet, the packet length of the third packet...the packet length of the tenth packet; the interval time between the first packet and the zeroth packet (0 ), the interval between the second packet and the first packet, the interval between the third packet and the second packet... the interval between the tenth packet and the ninth packet; the direction of the first packet, The direction of the second bag, the direction of the third bag...the direction of the tenth bag. A preliminary feature sequence is constructed according to the order of the data packets in the TCP flow, and then the feature selection algorithm is used to screen the packet features to obtain the final feature sequence.

S103:建立决策树分类模型。S103: Establish a decision tree classification model.

利用高级编程语言如Java、Matlab等或直接利用机器学习Weka软件,根据前期真实网络数据流量,基于经典的C4.5算法建立决策树模型。Use high-level programming languages such as Java, Matlab, etc. or directly use machine learning Weka software to establish a decision tree model based on the classic C4.5 algorithm according to the real network data traffic in the early stage.

S104:对决策树进行结构转换并存储到FPGA的RAM中。S104: Transform the structure of the decision tree and store it in the RAM of the FPGA.

通过对决策树的遍历,一方面提取决策树的中间节点值,即对所有的路径进行从上到下的遍历及中间节点值的提取。一条路径上可能有基于不同属性的中间节点,同一种属性的中间节点也可能存在于不同的路径上。对一棵树的所有中间节点值进行提取后汇总结果,然后对同一属性的各中间节点值进行从小到大的排序,然后对所有属性的各个中间节点值按顺序进行从小到大的编码。例如,若包长属性的中间节点值从小到大分别为70、80、90、100、110,编码位数为3,则中间节点值70编码为000,中间节点值80编码为001,中间节点值90编码为010,中间节点值100编码为011,中间节点值110编码为100;包间隔属性及包方向属性中间节点值的编码与此类似。另一方面提取决策树的边缘节点值,对边缘节点值同样也进行编码,边缘节点值的编码是一个范围,取决于到达该边缘节点的路径所经历的各中间节点的编码值。中间节点值及其编码以及边缘节点值及其编码分别存储在两块分离的RAM中。其中在流量分类装置用于网络流量在线分类之前,以离线处理的方式建立决策树并对决策树进行数据结构转换。Through the traversal of the decision tree, on the one hand, the intermediate node values of the decision tree are extracted, that is, all paths are traversed from top to bottom and the intermediate node values are extracted. There may be intermediate nodes based on different attributes on a path, and intermediate nodes of the same attribute may also exist on different paths. Extract all intermediate node values of a tree and summarize the results, then sort the intermediate node values of the same attribute from small to large, and then encode the intermediate node values of all attributes in ascending order. For example, if the intermediate node values of the packet length attribute are 70, 80, 90, 100, and 110 from small to large, and the number of encoding digits is 3, then the intermediate node value 70 is encoded as 000, the intermediate node value 80 is encoded as 001, and the intermediate node value The value 90 is encoded as 010, the intermediate node value 100 is encoded as 011, and the intermediate node value 110 is encoded as 100; the encoding of the intermediate node value of the packet interval attribute and packet direction attribute is similar. On the other hand, the edge node value of the decision tree is extracted, and the edge node value is also encoded. The encoding of the edge node value is a range, which depends on the encoding value of each intermediate node experienced by the path to the edge node. Intermediate node values and their codes and edge node values and their codes are stored in two separate RAMs. Wherein, before the traffic classification device is used for online classification of network traffic, a decision tree is established in an off-line processing manner and data structure conversion is performed on the decision tree.

两块RAM的存储结构如表1、表2所示。表中一行为一个记录,每一个记录存储一个中间节点值及该中间节点值的编码值。其中,n1表示第1个属性的中间节点总数,n2表示第2个属性的中间节点总数,nk表示第k个属性的中间节点总数,m表示协议类型的总数。The storage structures of the two RAMs are shown in Table 1 and Table 2. A row in the table is a record, and each record stores an intermediate node value and an encoded value of the intermediate node value. Among them, n1 represents the total number of intermediate nodes of the first attribute, n2 represents the total number of intermediate nodes of the second attribute, nk represents the total number of intermediate nodes of the k-th attribute, and m represents the total number of protocol types.

表1中间节点值及其编码值在RAM中的存储方式Table 1 Storage method of intermediate node value and its encoded value in RAM

  中间节点值intermediate node value  中间节点编码值Intermediate node code value  第1个属性第1个中间节点值The value of the first intermediate node of the first attribute  第1个属性第1个中间节点编码值The encoding value of the first intermediate node of the first attribute  第1个属性第2个中间节点值The value of the first attribute and the second intermediate node  第1个属性第2个中间节点编码值The encoding value of the first attribute and the second intermediate node  …...   …...  第1个属性第n1个中间节点值The first attribute is the value of the n1th intermediate node 第1个属性第n1个中间节点编码值The encoding value of the n1th intermediate node of the first attribute  第2个属性第1个中间节点值The value of the first intermediate node of the second attribute  第2个属性第1个中间节点编码值The encoding value of the first intermediate node of the second attribute  第2个属性第2个中间节点值The value of the second intermediate node of the second attribute  第2个属性第2个中间节点编码值The code value of the second intermediate node of the second attribute  …...  …...  第2个属性第n2个中间节点值The second attribute is the value of the n2th intermediate node 第2个属性第n2个中间节点编码值The second attribute is the code value of the n2th intermediate node  …………………………………………  ………………………..………………………..  第k个属性第1个中间节点值The first intermediate node value of the kth attribute  第k个属性第1个中间节点编码值The code value of the first intermediate node of the kth attribute  第k个属性第2个中间节点值The second intermediate node value of the kth attribute  第k个属性第2个中间节点编码值The encoding value of the second intermediate node of the kth attribute  …...  …...  第k个属性第nk个中间节点值The kth attribute is the value of the nkth intermediate node 第k个属性第nk个中间节点编码值The encoding value of the kth attribute and the nkth intermediate node

表2节点值及其编码值在RAM中的存储方式Table 2 The storage method of node value and its encoded value in RAM

  节点值(即类别值)Node value (i.e. category value)  节点编码值Node code value  第1种协议类型值The first protocol type value 第1种协议类型编码值范围The first protocol type encoding value range  第2种协议类型值The second protocol type value 第2种协议类型编码值范围The second protocol type encoding value range  …………………………………………  …………………….…………………….  第m种协议类型值The value of the mth protocol type 第m种协议类型编码值范围Range of encoding values for the mth protocol type

S105:对待分类的数据包进行分流S105: Distribute the data packets to be classified

根据{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将数据包划分到不同的流并维护流信息表,流信息表用于记录流的五元组信息以及该条流的类别。在此,仅对语义完整的TCP流进行分析。以TCP的3次握手为流的开始,以TCP的FIN=1或RST=1作为流的结束。根据网络中报文的五元组信息{源地址、目的地址、源端口、目的端口、传输层协议类型}判断是否为一条流。若五元组相同,则属于同一个流。否则,为不同的流。其中,若两个包的源地址相同,则属于同向网络流;若源地址与目的地址相同,则属于反向网络流;并约定,以第一个报文的转发方向为该网络流的上行方向。此外,若两个报文间隔超过一定时间,则属于不同的网络流。流信息表中每一条记录包括如下内容:标识一条流的ID、{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组、识别出来的协议类型。流信息表仅需要保存已分类的流的记录,不需要保存未分类的流的记录,因此对流信息表进行查找时,若不存在记录则可以立即判断为未分类,从而节省查找时间。According to the five-tuple of {source address, destination address, source port, destination port, transport layer protocol type}, the data packet is divided into different flows and the flow information table is maintained. The flow information table is used to record the five-tuple information of the flow and the The category of the stream. Here, only semantically complete TCP streams are analyzed. The flow begins with TCP's 3-way handshake, and ends with TCP's FIN=1 or RST=1. According to the five-tuple information {source address, destination address, source port, destination port, transport layer protocol type} of the packet in the network, it is judged whether it is a flow. If the quintuples are the same, they belong to the same stream. Otherwise, a different stream. Among them, if the source addresses of the two packets are the same, they belong to the same network flow; if the source address and the destination address are the same, they belong to the reverse network flow; and it is agreed that the forwarding direction of the first message is the network flow up direction. In addition, if the interval between two packets exceeds a certain period of time, they belong to different network flows. Each record in the flow information table includes the following contents: ID for identifying a flow, {source address, destination address, source port, destination port, transport layer protocol type} five-tuple, and the identified protocol type. The flow information table only needs to save the records of classified flows, not the records of unclassified flows. Therefore, when searching the flow information table, if there is no record, it can be judged as unclassified immediately, thus saving search time.

S106:判断该数据包所属的TCP流是否已分类S106: Determine whether the TCP stream to which the data packet belongs has been classified

利用S105提取的数据包的五元组信息,对流信息表进行查找,看表中是否已存在该五元组代表的流所对应的记录,如果存在记录,则读出该条流的类别值,如果不存在记录,则该条流未被分类。Utilize the quintuple information of the data packet that S105 extracts, search the flow information table, see whether the record corresponding to the flow that the quintuple represents already exists in the table, if there is a record, then read the category value of this flow, If no record exists, the flow is not classified.

S107:对已分类的数据包打上正确标签S107: Labeling the classified data packets correctly

利用步骤S106获取的类别信息对所有经过的数据包进行打标签处理,若数据包所属的流已经被分类,则打上相应的类别标签,分类结束。Use the category information acquired in step S106 to label all passing data packets. If the flow to which the data packet belongs has been classified, label the corresponding category, and the classification ends.

S108:对未分类的数据包打上默认标签并提取待分类TCP流的包特征S108: Put a default label on the unclassified data packet and extract the packet characteristics of the TCP flow to be classified

对于未分类的数据包,按照一定的原则标记一个默认的标签,然后判断该数据包是否需要被提取包特征并做相应处理。在这里,包特征的提取与S102中采用的最终特征序列相对应,即提取第某个包或某些包的某个属性或某些属性是与S102中的最终特征序列相一致的,需要按包到达顺序进行提取,并构建待分类流的特征序列。与流信息表类似,流量信息提取模块也要维护一张参数表,参数表中每一条记录包括如下内容:标识一条流的ID、源地址、目的地址、源端口、目的端口、传输层协议类型}五元组、某个包与前一个包的间隔时间、某个包的包长、某个包的包方向、该条流参数是否已满标志。For unclassified data packets, mark a default label according to certain principles, and then judge whether the data packet needs to be extracted and processed accordingly. Here, the extraction of package features corresponds to the final feature sequence used in S102, that is, the extraction of a certain attribute or certain attributes of the first package or some packages is consistent with the final feature sequence in S102, and needs to be pressed according to Packet arrival sequence is extracted, and the feature sequence of the flow to be classified is constructed. Similar to the flow information table, the traffic information extraction module also maintains a parameter table. Each record in the parameter table includes the following content: ID, source address, destination address, source port, destination port, transport layer protocol type identifying a flow } quintuple, the interval between a certain packet and the previous packet, the packet length of a certain packet, the packet direction of a certain packet, and whether the flow parameter is full or not.

网络数据(即数据包传输过程中的帧)传输是不受影响的,原因在于流量信息提取模块像一颗数据探针,仅仅将路过该模块的参数信息拷贝出来,而不改变任何数据以及数据的传输时序。The transmission of network data (that is, frames in the process of data packet transmission) is not affected, because the traffic information extraction module is like a data probe, which only copies the parameter information passing through the module without changing any data and data transmission timing.

S109,决策树查找。S109, decision tree search.

利用S108所得的待分类流的特征序列对S104所得的两块RAM进行查找,判断该TCP流的类别值并更新流信息表。在查找过程中采用并行处理策略,仅需要两个时钟周期即可完成决策树的查找过程。即第一个时钟周期并行比较所有属性的所有中间节点值,确定该流所属的所有中间节点编码值。也就是说,第一个时钟周期需要完成第1个属性的n1个中间节点值的比较以确定第1个属性所属的中间节点值范围区间,完成第2个属性的n2个中间节点值的比较以确定第2个属性所属的中间节点值范围区间……完成第k个属性的nk个中间节点值的比较以确定第k个属性所属的边缘节点值范围区间,而这n1+n2+…+nk个比较器是同时并行开始执行的。第一个时钟周期结束后,即可确定该流所属的所有属性的中间节点范围,通过RAM中的记录同时可以确定该流所属的所有中间节点编码值,其中一个属性对应一个中间节点编码值,则一条流对应k个中间节点编码值,将这k个中间节点编码值合并为一个数据,第二个时钟周期利用前一个时钟周期的合并结果数据并行比较所有边缘节点编码值,从而确定该流的边缘节点值,也就是协议类别值。Use the feature sequence of the flow to be classified obtained in S108 to search the two blocks of RAM obtained in S104, determine the category value of the TCP flow and update the flow information table. The parallel processing strategy is adopted in the search process, and only two clock cycles are needed to complete the search process of the decision tree. That is, in the first clock cycle, all intermediate node values of all attributes are compared in parallel, and all intermediate node encoding values to which the stream belongs are determined. That is to say, the first clock cycle needs to complete the comparison of n1 intermediate node values of the first attribute to determine the range of intermediate node values to which the first attribute belongs, and complete the comparison of n2 intermediate node values of the second attribute To determine the value range interval of the intermediate node to which the second attribute belongs...Complete the comparison of nk intermediate node values of the kth attribute to determine the value range interval of the edge node to which the kth attribute belongs, and this n1+n2+...+nk Comparators are executed in parallel at the same time. After the end of the first clock cycle, the intermediate node ranges of all the attributes to which the stream belongs can be determined, and the code values of all intermediate nodes to which the stream belongs can be determined at the same time through the records in RAM, one of which corresponds to an intermediate node code value, Then a flow corresponds to k intermediate node encoding values, and the k intermediate node encoding values are combined into one data, and the second clock cycle uses the merged result data of the previous clock cycle to compare all edge node encoding values in parallel to determine the stream The edge node value of , that is, the protocol category value.

图2为本发明所提供的流量分类装置的结构示意图。Fig. 2 is a schematic structural diagram of the flow classification device provided by the present invention.

从功能上看,该流量分类装置可以分为在线和离线两个部分。离线部分主要完成决策树的构造及数据结构转换;在线部分主要负责未知数据流的分类。离线部分包括顺序连接的前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206以及后期的分类结果处理模块207;在线部分包括顺序连接的MAC层处理模块一211、数据包轮询管理模块212、分流判断模块213、流量信息提取及打标签模块214、决策树查找模块215、MAC层处理模块二216。From a functional point of view, the flow classification device can be divided into two parts: online and offline. The offline part mainly completes the construction of the decision tree and the data structure conversion; the online part is mainly responsible for the classification of unknown data streams. The offline part includes sequentially connected pre-data flow collection module 201, pre-data stream shunting module 202, pre-data stream manual classification module 203, pre-data stream feature extraction module 204, decision tree building module 205, decision tree structure conversion module 206 and later classification result processing module 207; the online part includes sequentially connected MAC layer processing module one 211, data packet polling management module 212, distribution judgment module 213, traffic information extraction and labeling module 214, decision tree search module 215, MAC layer Process Module II 216.

在本流量分类装置中,前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206可在装置部署前完成,因此不是使用流量分类的装置或者系统的必要组成部分。而MAC层处理模块一211、数据包轮询管理模块212、分流判断模块213、流量信息提取及打标签模块214、决策树查找模块215、MAC层处理模块二216、分类结果处理模块207一般应在使用流量分类的装置或系统中出现。In this traffic classification device, the pre-data flow collection module 201, the pre-data stream distribution module 202, the pre-data stream manual classification module 203, the pre-data stream feature extraction module 204, the decision tree building module 205, and the decision tree structure conversion module 206 can be It is done before the device is deployed and therefore is not a required part of the device or system that uses traffic classification. And MAC layer processing module one 211, data packet polling management module 212, distribution judgment module 213, traffic information extraction and labeling module 214, decision tree search module 215, MAC layer processing module two 216, classification result processing module 207 generally should Occurs in devices or systems that use traffic classification.

每一模块具体功能和处理流程如下:在带有流量分类的装置或系统使用前,需要使用前期数据流量采集模块201、前期数据流分流模块202、前期数据流人工分类模块203、、前期数据流特征提取模块204、决策树建树模块205、决策树结构转换模块206完成图1中S101~S104的工作,形成的经过转换的决策树数据结构置于装置中的RAM中。当一个未知类别的数据包进入流量分类装置后,而MAC层处理模块一211、数据包轮询管理模块212对数据包进行预处理,分流判断模块213根据{源地址、目的地址、源端口、目的端口、传输层协议类型}五元组将数据包划分到不同的流并维护流信息表,然后对流信息表进行查找以确定数据包所属的类别,完成图1中S105~S106的工作。流量信息提取及打标签模块214根据分流判断模块213所获取的类别信息对数据包进行打标签处理,同时按数据包先后顺序,依次提取包长、修正包间隔时间、传送方向等参数,形成特征序列,送入决策树查找模块215、完成图1中S107~S109的工作。流信息表仅由决策树查找模块215进行更新处理,其他模块均不能对流信息表进行写操作。MAC层处理模块二216及分类结果处理模块207对数据包进行后续的处理并显示分类结果。The specific functions and processing flow of each module are as follows: Before the device or system with traffic classification is used, it is necessary to use the early data flow collection module 201, the early data flow distribution module 202, the early data flow manual classification module 203, and the early data flow The feature extraction module 204, the decision tree building module 205, and the decision tree structure conversion module 206 complete the work of S101-S104 in FIG. 1, and the formed converted decision tree data structure is placed in the RAM of the device. After a data packet of an unknown category enters the traffic classification device, the MAC layer processing module one 211 and the data packet polling management module 212 preprocess the data packet, and the distribution judging module 213 is based on {source address, destination address, source port, Destination port, transport layer protocol type} quintuple divides the data packet into different flows and maintains the flow information table, and then searches the flow information table to determine the category to which the data packet belongs, and completes the work of S105-S106 in Figure 1. The traffic information extraction and labeling module 214 performs labeling processing on the data packets according to the type information obtained by the distribution judging module 213, and simultaneously extracts parameters such as packet length, corrected packet interval time, and transmission direction according to the sequence of the data packets to form a feature The sequence is sent to the decision tree search module 215 to complete the work of S107-S109 in FIG. 1 . The flow information table is only updated by the decision tree search module 215, and other modules cannot write to the flow information table. The second MAC layer processing module 216 and the classification result processing module 207 perform subsequent processing on the data packet and display the classification result.

本实施例提供的方法和装置,对采用C4.5算法建立的决策树进行了数据结构转换,使之转换成一种易于硬件实现的数据结构,降低了算法本身的复杂度;决策树查找过程中使用了并行查找和流水线技术,提高了处理速度;选取的包特征提取过程简单,易于在线完成;利用了C4.5算法本身所具有的准确度高、稳定性好的特点。因此,本实施例可以方便的实现网络流量的高速在线分类。The method and device provided by this embodiment convert the data structure of the decision tree established by the C4.5 algorithm, so that it is converted into a data structure that is easy to implement in hardware, reducing the complexity of the algorithm itself; Using parallel search and pipeline technology, the processing speed is improved; the feature extraction process of the selected package is simple and easy to complete online; the C4.5 algorithm itself has been used for its high accuracy and good stability. Therefore, this embodiment can conveniently implement high-speed online classification of network traffic.

最后应说明的是:以上实施例仅用以说明本发明的技术方案及装置,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解;其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions and devices of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand; It is still possible to modify the technical solutions described in the foregoing embodiments, or to perform equivalent replacements on the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention. .

Claims (9)

Translated fromChinese
1.一种基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于包括以下步骤:1. a kind of method based on decision tree high-speed parallel processing strategy realizes TCP traffic online classification, it is characterized in that comprising the following steps:步骤1,前期真实流量数据的采集、分流及手工分类:采集网络真实流量数据集,利用五元组将数据集分离为不同的TCP流,对TCP流的集合进行手工分类,使每一条TCP流都与一种协议类型相对应;Step 1, the collection, distribution and manual classification of real traffic data in the early stage: collect the real traffic data set of the network, use quintuple to separate the data set into different TCP streams, and manually classify the collection of TCP streams, so that each TCP stream Both correspond to a protocol type;步骤2,提取前期TCP流集合的若干个包特征:提取每一条TCP流中关于数据包的特征,并按照数据包在该TCP流的先后顺序构建初步特征序列,然后对包特征进行筛选,得到最终特征序列;Step 2, extract several packet features of the previous TCP flow set: extract the characteristics of the data packets in each TCP flow, and construct a preliminary feature sequence according to the order of the data packets in the TCP flow, and then filter the packet characteristics to obtain final feature sequence;步骤3,C4.5决策树分类模型的建立:对步骤2构成的最终特征序列,利用C4.5决策树算法进行建树;Step 3, establishment of C4.5 decision tree classification model: use C4.5 decision tree algorithm to construct the final feature sequence formed in step 2;步骤4,对步骤3中建立的决策树进行数据结构转换并存储到硬件设备的存储设备中:通过对决策树的遍历,一方面提取决策树的分支节点值,对同一属性的各分支节点值进行从小到大的排序,然后对所有属性的各个分支节点值按顺序进行从小到大的编码,另一方面提取决策树的叶子节点值,对叶子节点值同样也进行编码,叶子节点值的编码是一个范围,取决于到达该叶子节点所经历的各分支节点的编码值;分支节点值及其编码以及叶子节点值及其编码分别存储在两块分离的存储设备中;Step 4, convert the data structure of the decision tree established in step 3 and store it in the storage device of the hardware device: by traversing the decision tree, on the one hand, extract the branch node values of the decision tree, and for each branch node value of the same attribute Sorting from small to large, and then coding each branch node value of all attributes from small to large in order, on the other hand, extracting the leaf node value of the decision tree, and encoding the leaf node value, the coding of the leaf node value It is a range, which depends on the coding values of each branch node experienced to reach the leaf node; the branch node value and its coding, and the leaf node value and its coding are stored in two separate storage devices;步骤5,对待分类的数据包进行分流及类别判断:根据五元组将数据包划分到不同的流并查找流信息表获取分类信息,流信息表用于记录流的五元组信息以及该条流的类别;Step 5, divide the data packets to be classified and judge the category: divide the data packets into different flows according to the five-tuple and search the flow information table to obtain the classification information. The flow information table is used to record the five-tuple information of the flow and the class of flow;步骤6,对当前数据包进行打标签处理并提取待分类TCP流的包特征:利用步骤5提取的类别信息对所有经过的数据包进行打标签处理,若数据包所属的流已经被分类,则打上相应的类别标签,若未分类,则按照一定的原则标记一个默认的标签,然后判断该数据包是否需要被提取包特征并做相应处理;在这里,包特征的提取与步骤2中采用的最终特征序列相对应,需要按包到达顺序进行提取,并构建待分类流的特征序列,待分类流的特征序列存储在参数表中,参数表的一条记录包括五元组、各个包特征值以及参数是否满的标志;Step 6: Tag the current data packet and extract the packet characteristics of the TCP flow to be classified: use the category information extracted in step 5 to tag all passing data packets, if the flow to which the data packet belongs has been classified, then Put the corresponding category label, if it is not classified, then mark a default label according to certain principles, and then judge whether the data packet needs to be extracted and processed accordingly; here, the extraction of the package feature is the same as that used in step 2 Corresponding to the final feature sequence, it needs to be extracted according to the order of packet arrival, and the feature sequence of the flow to be classified is constructed. The feature sequence of the flow to be classified is stored in the parameter table. A record in the parameter table includes five tuples, each packet feature value, and Whether the parameter is full or not;步骤7,决策树查找:利用步骤6所得的待分类流的特征序列对步骤3所得的两块存储设备进行查找,判断该TCP流的类别值并更新流信息表;该步骤采用并行查找的方式及流水线的结构以提高查找速度,在不考虑其他读写信号及时钟同步处理的情况下,仅需要两个时钟周期即可完成决策树的查找过程,即第一个时钟周期并行比较所有属性的所有分支节点值,确定该流所属的所有分支节点编码值并合并为一个数据,第二个时钟周期利用前一个时钟周期的结果数据并行比较所有叶子节点的编码值,从而确定该流的类别。Step 7, decision tree search: use the characteristic sequence of the flow to be classified obtained in step 6 to search the two storage devices obtained in step 3, judge the category value of the TCP flow and update the flow information table; this step adopts the method of parallel search And the structure of the pipeline to improve the search speed. Without considering other read and write signals and clock synchronization processing, only two clock cycles are needed to complete the search process of the decision tree, that is, the first clock cycle compares all attributes in parallel. For all branch node values, determine the encoding values of all branch nodes to which the flow belongs and combine them into one data. In the second clock cycle, the result data of the previous clock cycle is used to compare the encoding values of all leaf nodes in parallel to determine the category of the flow.2.根据权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于:2. the method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy according to claim 1, is characterized in that:其中C4.5决策树分类模型的建立,是基于步骤1获得的采集网络真实流量数据集和步骤2获得的包特征,通过离线处理建立C4.5决策树并对决策树进行数据结构转换来完成的。Among them, the establishment of the C4.5 decision tree classification model is based on the collected network real traffic data set obtained in step 1 and the packet characteristics obtained in step 2. The C4.5 decision tree is established through offline processing and the data structure conversion of the decision tree is completed. of.3.根据权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于:3. the method for realizing TCP flow online classification based on decision tree high-speed parallel processing strategy according to claim 1, is characterized in that:所述步骤2中,需根据特征选择算法对初步提取的若干个包特征进行处理,筛选出最能体现流类别特性的包特征。In the step 2, the initially extracted packet features need to be processed according to the feature selection algorithm, and the packet features that can best reflect the characteristics of the flow category are screened out.4.根据权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于:4. the method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy according to claim 1, is characterized in that:所述步骤5中,流信息表仅需要保存已分类的流的记录,不需要保存未分类的流的记录,因此对流信息表进行查找时,若不存在记录则可以立即判断为未分类,从而节省查找时间。In the step 5, the flow information table only needs to save the records of the classified flows, and does not need to save the records of the unclassified flows. Therefore, when searching the flow information table, if there is no record, it can be judged as unclassified immediately, thus Save search time.5.根据权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于:5. the method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy according to claim 1, is characterized in that:所述步骤6中,将包按照包到达顺序进行排列,取三次握手的第一个请求包即Setup包作为该流的第一个包。In the step 6, the packets are arranged according to the arrival order of the packets, and the first request packet of the three-way handshake, that is, the Setup packet is taken as the first packet of the flow.6.根据权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,其特征在于:6. the method for realizing TCP traffic online classification based on decision tree high-speed parallel processing strategy according to claim 1, is characterized in that:所述步骤6中,已分类被打上正确标签的包与未分类打上默认标签且需要进行参数提取的包需要进行时钟同步处理,即插入相应级别的流水线以保证数据包传输路线上的FIFO不会出现溢出现象。In the step 6, the packets that have been classified and labeled with the correct label and the packets that have not been classified and labeled with the default label and need to be extracted need to be clock-synchronized, that is, the corresponding level of pipeline is inserted to ensure that the FIFO on the data packet transmission route will not Overflow occurs.7.一种TCP流量在线分类装置,用于实现如权利要求1所述的基于决策树高速并行处理策略实现TCP流量在线分类的方法,该装置包括在线部分和离线部分,其中离线部分具有C4.5决策树建树模块、C4.5决策树结构转换模块、分类结果处理模块;在线部分包括顺序连接的MAC层处理模块一、数据包轮询管理模块、分流判断模块、流量信息提取及打标签模块、C4.5决策树查找模块、MAC层处理模块二;7. A TCP flow online classification device, for realizing the method for realizing the TCP flow online classification based on decision tree high-speed parallel processing strategy as claimed in claim 1, the device includes an online part and an offline part, wherein the offline part has a C4. 5 Decision tree building module, C4.5 decision tree structure conversion module, classification result processing module; the online part includes sequentially connected MAC layer processing module 1, data packet polling management module, distribution judgment module, traffic information extraction and labeling module , C4.5 decision tree search module, MAC layer processing module two;C4.5决策树建树模块,用于根据前期真实网络数据流量建立决策树模型;C4.5 Decision tree building module, used to build a decision tree model based on the real network data traffic in the previous period;C4.5决策树结构转换模块,用于将决策树模型的结构进行转换,使之成为易于硬件实现的另一种数据结构并存储到硬件设备的存储设备中:通过对决策树的遍历,一方面提取决策树的分支节点值,对同一属性的各分支节点值进行从小到大的排序,然后对所有属性的各个分支节点值按顺序进行从小到大的编码,另一方面提取决策树的叶子节点值,对叶子节点值同样也进行编码,叶子节点值的编码是一个范围,取决于到达该叶子节点所经历的各分支节点的编码值;分支节点值及其编码以及叶子节点值及其编码分别存储在两块分离的存储设备中;C4.5 The decision tree structure conversion module is used to convert the structure of the decision tree model to make it another data structure that is easy to realize by hardware and store it in the storage device of the hardware device: by traversing the decision tree, a On the one hand, extract the branch node values of the decision tree, sort the branch node values of the same attribute from small to large, and then encode the values of each branch node of all attributes in order from small to large; on the other hand, extract the leaves of the decision tree Node value, the leaf node value is also encoded, the encoding of the leaf node value is a range, depending on the encoding value of each branch node that reaches the leaf node; the branch node value and its encoding, and the leaf node value and its encoding respectively stored in two separate storage devices;MAC层处理模块如今有许多现成的IP核可以采用;The MAC layer processing module now has many ready-made IP cores that can be used;数据包轮询管理模块,用于从N个数据包缓冲队列中读取数据包,此处采用轮询式访问每个输入队列,直到从一个队列读完一个完整的数据包才转到下一个队列;The data packet polling management module is used to read data packets from N data packet buffer queues. Here, polling is used to access each input queue until a complete data packet is read from a queue before going to the next one. queue;分流判断模块,用于根据五元组将数据包划分到不同的流,判断该条流是否被分类,如果被分类类别值是多少,并维护流信息表,流信息表用于记录流的五元组信息以及该条流的类别;The distribution judging module is used to divide the data packet into different streams according to the five-tuple, judge whether the stream is classified, and if it is classified, what is the value of the category, and maintain the stream information table, which is used to record the five streams of the stream Tuple information and the category of the stream;流量信息提取及打标签模块,用于对未分类的数据包进行流特征提取并对所有数据包进行打标签处理;该模块需要维护一张参数表,参数表记录各条流的五元组信息、特征值以及特征值是否满的标志;特征值满的流的参数信息被发送到C4.5决策树查找模块;The flow information extraction and labeling module is used to extract flow characteristics of unclassified data packets and label all data packets; this module needs to maintain a parameter table, which records the five-tuple information of each flow , eigenvalue, and the sign of whether the eigenvalue is full; the parameter information of the flow with full eigenvalue is sent to the C4.5 decision tree search module;C4.5决策树查找模块,用于利用流量信息提取及打标签模块发送的待分类流的特征序列对决策树结构转换模块所得的两块存储设备进行查找,判断该TCP流的类别值并更新流信息表;在查找过程中采用并行处理策略,仅需要两个时钟周期即可完成决策树的查找过程,即第一个时钟周期并行比较所有属性的所有分支节点值,确定该流所属的所有分支节点编码值并合并为一个数据,第二个时钟周期利用前一个时钟周期的结果数据并行比较所有叶子节点的编码值,从而确定该流的类别,并将分类结果发送至流信息表以对流信息表中的记录进行更新;C4.5 The decision tree search module is used to search the two storage devices obtained by the decision tree structure conversion module by using the feature sequence of the stream to be classified sent by the traffic information extraction and labeling module, and judge the category value of the TCP stream and update it Flow information table; the parallel processing strategy is used in the search process, and only two clock cycles are needed to complete the search process of the decision tree, that is, the first clock cycle compares all branch node values of all attributes in parallel to determine all The coded values of the branch nodes are merged into one data, and the second clock cycle uses the result data of the previous clock cycle to compare the coded values of all leaf nodes in parallel to determine the category of the flow, and send the classification result to the flow information table for flow analysis. The records in the information table are updated;分类结果处理模块,用于对流量分类的结果进行汇总、处理及界面显示。The classification result processing module is used to summarize, process and display the traffic classification results.8.根据权利要求7所述的TCP流量在线分类装置,其特征在于:8. The TCP flow online classification device according to claim 7, characterized in that:离线部分的C4.5决策树结构转换模块与在线部分的C4.5决策树查找模块相连接,离线部分的分类结果处理模块与在线部分的C4.5决策树查找模块通过一个流信息表间接连接。The C4.5 decision tree structure conversion module in the offline part is connected with the C4.5 decision tree search module in the online part, and the classification result processing module in the offline part is indirectly connected with the C4.5 decision tree search module in the online part through a flow information table .9.根据权利要求7所述的TCP流量在线分类装置,其特征在于:9. The TCP flow online classification device according to claim 7, characterized in that:所述离线部分还具有前期数据流量采集模块、前期数据流分流模块、前期数据流人工分类模块、前期数据流特征提取模块。The offline part also has a pre-data flow collection module, a pre-data stream distribution module, a pre-data stream manual classification module, and a pre-data stream feature extraction module.
CN201210006268.7A2012-01-092012-01-09 Method and device for online classification of network traffic based on high-speed parallel processing of decision treesExpired - Fee RelatedCN102523241B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201210006268.7ACN102523241B (en)2012-01-092012-01-09 Method and device for online classification of network traffic based on high-speed parallel processing of decision trees

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201210006268.7ACN102523241B (en)2012-01-092012-01-09 Method and device for online classification of network traffic based on high-speed parallel processing of decision trees

Publications (2)

Publication NumberPublication Date
CN102523241A CN102523241A (en)2012-06-27
CN102523241Btrue CN102523241B (en)2014-11-19

Family

ID=46294033

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201210006268.7AExpired - Fee RelatedCN102523241B (en)2012-01-092012-01-09 Method and device for online classification of network traffic based on high-speed parallel processing of decision trees

Country Status (1)

CountryLink
CN (1)CN102523241B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9444730B1 (en)2015-11-112016-09-13International Business Machines CorporationNetwork traffic classification

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102904890A (en)*2012-10-122013-01-30哈尔滨工业大学深圳研究生院 Cloud data packet header state detection method
CN103209169B (en)*2013-02-232016-03-09北京工业大学A kind of network traffics filtration system based on FPGA and method
CN104125106A (en)*2013-04-232014-10-29中国银联股份有限公司Network purity detection device and method based on classified decision tree
EP3025522B1 (en)*2013-12-132019-10-23Telefonaktiebolaget LM Ericsson (publ)Traffic coordination for communication sessions involving wireless terminals and server devices
EP3257235B1 (en)*2015-02-102020-05-13Telefonaktiebolaget LM Ericsson (publ)A method and apparatus for data mediation
CN105162663B (en)*2015-09-252019-02-19中国人民解放军信息工程大学A kind of online method for recognizing flux based on adfluxion
CN106408007A (en)*2016-09-072017-02-15国家电网公司Power communication network flow classification method and system
CN106572486B (en)*2016-10-172020-11-27湖北大学 A method and system for identifying traffic of handheld terminal based on machine learning
CN106975617B (en)*2017-04-122018-10-23北京理工大学A kind of Classification of materials method based on color selector
CN107391912A (en)*2017-07-042017-11-24大连大学 Hospital clinical operation data selection method based on big and small flow classification applied in cloud data center system
CN108304164B (en)*2017-09-122021-12-03马上消费金融股份有限公司Business logic development method and development system
CN108229573B (en)*2018-01-172021-05-25北京中星微人工智能芯片技术有限公司 Classification calculation method and device based on decision tree
CN108921600B (en)*2018-06-202024-10-22京东科技控股股份有限公司Apparatus and method for implementing information classification and storage medium
CN109086815B (en)*2018-07-242021-08-31中国人民解放军国防科技大学 Floating point discretization method in decision tree model based on FPGA
CN109063777B (en)*2018-08-072019-12-03北京邮电大学Net flow assorted method, apparatus and realization device
CN109246095B (en)*2018-08-292019-06-21四川大学 A communication data encoding method suitable for deep learning
CN109784370B (en)*2018-12-142024-05-10中国平安财产保险股份有限公司Decision tree-based data map generation method and device and computer equipment
CN110445689B (en)*2019-08-152022-03-18平安科技(深圳)有限公司Method and device for identifying type of equipment of Internet of things and computer equipment
CN112887300B (en)*2021-01-222022-02-01北京交通大学Data packet classification method
CN113240036B (en)*2021-05-282023-11-07北京达佳互联信息技术有限公司Object classification method and device, electronic equipment and storage medium
CN113360740B (en)*2021-06-042022-10-11上海天旦网络科技发展有限公司Data packet labeling method and system
CN113810311B (en)*2021-09-142024-11-26北京左江科技股份有限公司 A packet classification method based on multiple decision trees
CN114660939B (en)*2022-03-292024-07-16北京百度网讯科技有限公司Object control method and device, electronic equipment and storage medium
CN114900474B (en)*2022-05-052023-08-22鹏城实验室Data packet classification method, system and related equipment for programmable switch
CN115914141A (en)*2022-09-232023-04-04暨南大学P4 hardware switch-based network data flow classification prediction method
US12388729B2 (en)*2023-01-052025-08-12Samsung Electronics Co., Ltd.Methods and apparatus for detecting network services
CN116226893B (en)*2023-05-092023-08-01北京明苑风华文化传媒有限公司Client marketing information management system based on Internet of things
CN116521963A (en)*2023-07-042023-08-01北京智麟科技有限公司Method and system for processing calculation engine data based on componentization

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5870735A (en)*1996-05-011999-02-09International Business Machines CorporationMethod and system for generating a decision-tree classifier in parallel in a multi-processor system
CN101184097A (en)*2007-12-142008-05-21北京大学 A Method of Detecting Worm Activities Based on Flow Information
CN102214213A (en)*2011-05-312011-10-12中国科学院计算技术研究所Method and system for classifying data by adopting decision tree
CN102271090A (en)*2011-09-062011-12-07电子科技大学 Traffic classification method and device based on transport layer characteristics

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US7039762B2 (en)*2003-05-122006-05-02International Business Machines CorporationParallel cache interleave accesses with address-sliced directories
US8290882B2 (en)*2008-10-092012-10-16Microsoft CorporationEvaluating decision trees on a GPU

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US5870735A (en)*1996-05-011999-02-09International Business Machines CorporationMethod and system for generating a decision-tree classifier in parallel in a multi-processor system
CN101184097A (en)*2007-12-142008-05-21北京大学 A Method of Detecting Worm Activities Based on Flow Information
CN102214213A (en)*2011-05-312011-10-12中国科学院计算技术研究所Method and system for classifying data by adopting decision tree
CN102271090A (en)*2011-09-062011-12-07电子科技大学 Traffic classification method and device based on transport layer characteristics

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers,Inc.,1993;Steven L.Salzberg;《machine learning》;19940930;第16卷(第3期);第235-240页*
fast traffic classification in high speed networks;Rentao Gu 等;《challenges for next generation network operation and service management》;20081231;第429-432页*
Rentao Gu 等.fast traffic classification in high speed networks.《challenges for next generation network operation and service management》.2008,第429-432页.*
Steven L.Salzberg.C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers,Inc.,1993.《machine learning》.1994,第16卷(第3期),第235-240页.*
数据挖掘中决策树算法的最新进展;韩慧 等;《计算机应用研究》;20041231;第21卷(第12期);第5-8页*
数据挖掘的并行策略研究;颜雪松 等;《计算机工程与应用》;20030131;第39卷(第3期);第187-189页*
韩慧 等.数据挖掘中决策树算法的最新进展.《计算机应用研究》.2004,第21卷(第12期),*
颜雪松 等.数据挖掘的并行策略研究.《计算机工程与应用》.2003,第39卷(第3期),*

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9444730B1 (en)2015-11-112016-09-13International Business Machines CorporationNetwork traffic classification
US9596171B1 (en)2015-11-112017-03-14International Business Machines CorporationNetwork traffic classification
US9882807B2 (en)2015-11-112018-01-30International Business Machines CorporationNetwork traffic classification
US9942135B2 (en)2015-11-112018-04-10International Business Machines CorporationNetwork traffic classification

Also Published As

Publication numberPublication date
CN102523241A (en)2012-06-27

Similar Documents

PublicationPublication DateTitle
CN102523241B (en) Method and device for online classification of network traffic based on high-speed parallel processing of decision trees
CN113079069B (en)Mixed granularity training and classifying method for large-scale encrypted network traffic
Song et al.Encrypted traffic classification based on text convolution neural networks
CN106961445A (en)Message parsing method and its device based on FPGA hardware parallel pipeline
CN103297427A (en)Unknown network protocol identification method and system
CN101252541A (en) A method for establishing a network traffic classification model and a corresponding system
CN109167680A (en)A kind of traffic classification method based on deep learning
WO2015154484A1 (en)Traffic data classification method and device
CN101605126A (en) A method and system for classification and identification of multi-protocol data
CN108737290A (en)Non-encrypted method for recognizing flux based on load mapping and random forest
CN110034966B (en) A machine learning-based data stream classification method and system
CN103475537A (en)Method and device for message feature extraction
CN108846275A (en)Unknown Method of Detecting Operating System based on RIPPER algorithm
CN115600128A (en)Semi-supervised encrypted traffic classification method and device and storage medium
Yang et al.Deep learning-based reverse method of binary protocol
CN107943826A (en)A kind of high-speed data-flow sorter and method suitable for multiclass field
CN109376797A (en) A network traffic classification method based on binary encoder and multi-hash table
CN118041567A (en)Small sample malicious flow detection method based on multi-flow time sequence characteristics
Liu et al.Dynamic traffic classification algorithm and simulation of energy Internet of things based on machine learning
CN113382039A (en)Application identification method and system based on 5G mobile network flow analysis
CN105812280A (en)Classification method and electronic equipment
CN111767695B (en)Method for optimizing field boundary reasoning in protocol reverse engineering
CN101938420B (en)Cluster topological collection method and device
CN114666282B (en)Machine learning-based 5G flow identification method and device
CN115396363B (en) A traffic classification method and system in an SDN network environment

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C14Grant of patent or utility model
GR01Patent grant
CF01Termination of patent right due to non-payment of annual fee
CF01Termination of patent right due to non-payment of annual fee

Granted publication date:20141119


[8]ページ先頭

©2009-2025 Movatter.jp