200926674 九、發明說明: 【發明所屬之技術領域】 本發明係有關-種網路流量分類方法,特別是在網路流量 中分類所屬應用程式之方法。 【先前技術】 在網路流量的分析中,-種使用非封包内容判斷流量種類的方 〇 法,其封包_部到來_與大小_化加上魏關定來決定 該流量屬於何種_,但是此方法僅能判定流量是屬於即時訊息 (Instant Messaging,IM)的傳輸、某種應用程式的命令資料傳輸或是 某種應用程式的資料傳輸,並無法辨認出流量屬於何種應用程式。 傳統的網路偵測技術皆依靠應用程式已知埠號與封包内容特徵 值比對方式,這樣的方法已知有兩個缺點:(丨)無法偵測動態決定埠號 使用之應用程式、(2)封包内容如果被應用程式加密就無法透過内容特 徵值比對辨認。 另外一種方式是在點對點流量模式(P2P flow pattern)中,先 © 檢查兩點之間是否同時存在TCP及UDP連線,接下來去除掉一些已 知(well-known)應用的連線(ex. HTTP,SMTP,FTP),若兩點之間的 連線數目等同於埠號對(port pair)的數目的話,即將這些連線當作 是點對點P2P流量。此方法的限制在於現行的P2P流量很多都是跑在 已知槔號(well-knownport)。利用已知埠消去法並無法保證對於P2P 應用程式的偵測正確率。 再者’稱作BlinC的方法,透過三個層面(Social, Functional, Application)來分析流量。Social層面將每一來源(sourcO與哪些目 的地(destination)有溝通標示出來;若有某一群的來源(source)同 時與很多且同樣的目地(destination)溝通的情況’很有可能是病毒 5 200926674 攻擊流量((ex. Blaster);若目的地(destination)的數目是正常的, 报有可能是同時有一群人在瀏覽同一個網站或是串流(streaming)的 應用。Functional層面則是決定主機(host)所扮演的角色偏向Server、 Client或是P2P ;至於Application層面透過來源IP (source IP)、目的 地 IP(destination IP)、來源埠號(source port)、目的地埠號(destination Port )這些變數值組(4-tuple)再進一步來分辨流量是屬於哪一種應用 程式。此方法雖然有極高的準確度,但是因為需要大量的傳輸層資訊 做為判讀之用’所以相當耗時。200926674 IX. Description of the Invention: [Technical Field] The present invention relates to a method for classifying network traffic, and in particular, a method for classifying an application in network traffic. [Prior Art] In the analysis of network traffic, the method of judging the type of traffic using non-encapsulated content, the packet_part of arrival_ and size_plus plus Wei Guanding to determine what kind of traffic belongs to, but this The method can only determine whether the traffic belongs to Instant Messaging (IM) transmission, command data transmission of an application or data transmission of an application, and cannot identify which application the traffic belongs to. Traditional network detection technology relies on the application's known nickname and packet content feature value comparison method. This method is known to have two disadvantages: (丨) can not detect the application that dynamically determines the nickname, ( 2) If the content of the packet is encrypted by the application, it cannot be identified by the content feature value. Another way is to check whether there are both TCP and UDP connections between the two points in the P2P flow pattern. Then remove the connections of some well-known applications (ex. HTTP, SMTP, FTP). If the number of connections between two points is equal to the number of port pairs, these connections are considered as point-to-point P2P traffic. The limitation of this approach is that many of the current P2P traffic is running at known well-known ports. The use of known 埠 elimination methods does not guarantee the correct detection rate for P2P applications. Furthermore, the method called BlinC analyzes traffic through three levels (Social, Functional, Application). The Social level communicates each source (sourcO with which destinations); if there is a source of a group that communicates with many and the same destinations at the same time, it is likely to be a virus 5 200926674 Attack traffic (ex. Blaster); if the number of destinations is normal, the report may be a group of people browsing the same website or streaming applications. The Functional level determines the host. (host) role is biased towards Server, Client or P2P; as for Application level, source IP, destination IP, source port, destination port These variable value groups (4-tuple) go a step further to tell which application the traffic belongs to. This method has extremely high accuracy, but it is quite time consuming because it requires a large amount of transport layer information for interpretation.
在美國專利USPatent 6,157,955中提出一個可以針對網路介面 加上分類引擎機制的架構來對網路上的封包做分析與分類。在分類引 擎的n卩刀包含了兩個主體·封包頭端資訊解析(packet header parsjjjg ) 與雜湊表查詢機制(hash table lookups),而引擎的過賴制,則是由 主機端定義來蚊储樣的制程賴包可以通過。此專利提供了一 ,彈性的卿可贿意增加新的碱方針,並且可以動態決定,儘量 節省所需偵測的封包内容資訊。此篇專利類似於本中請案的架構(利 用套分類機制分類網路中的流量),但卻沒有進一步地對使用加密 協定的應用程式做偵測機制。 八在美國專利US Patent. 6,597中提出一套可以分析、預測及 2網路即時流量的架構,包含了—個可以儲存及處理封包時間資訊 彻在不同的時間點、不同的時間範_,_計接收到 與本’再利用統計出的封包時間資訊分類封包。此篇專利 過^盡、:採用統計計算後所得到的資訊作為封包判斷的依據。不 此篇㈣的是封包到達時_訊,與本巾請案称另外, 篇專利也不此進一步偵測經過加密後的封包。 雜凑表Patent 6,754,662中採用了—套發送引擎及一組 擎則會i收到二員ί包’雜湊表内儲存的是一組判斷識別名,發送引 則會在收到封包後,根據收到的封包資訊,計算出封包的雜湊值, 6 200926674 再嘗試以雜凑值^索弓丨,_存的雜絲中去尋找。 目則會根據網路流量_計,包含存取_、最近存取時間== 式種類,減流量長鱗作_,蚊存在轉表__長短用程 Ο ❹ 在美國專利uS.Patent 6,839,751中採用了一組封包 一組資料庫;資料庫主要是料儲存已經處理過的對話流量資 封包棟取裝置接受封包後,會贿料庫中查詢是否已喊理 處,,=根據包含的統計資訊,包擁有__果 封包到達B«、及此次封包與上次㈣的封包__差等更 庫。如果沒有處理過,則在資料庫中新增項目。 枓 傳統的賴分析技術針對應用程式使用已知蜂(MU _)方式做判定,但是現行許多存在於網路上的有害物質,由^ 用了動態埠(dynamiepcm)峨街,皆無法使_方法來辨認。' 現今廣泛使⑽“封_容舰值,,輯綠 刪Μ 封包域技_無法_此枝對其封包 偵測,造成管理上的漏洞。 又,某些惡意軟體利用偽裝封包内容的方式意圖躲避内 ,對的侧’傳財法有可龄因此產生誤擋或是騎之十 疋現今的封包内容細方式,有侵害個人隱私的問題。 — 目前的傳髓舰輯方式1有需躲 ,得正確判斷能力之缺陷,並且_時_長,無法 速決定網路流量管理政策的閘道器或是防火牆之上。 為罟、 【發明内容】 、由^為//決上述問題’本發明目的之—係提出可以用來_ ;疋刻意隱藏通訊協定之應用程式,以提供網 理上足夠的資訊。 200926674 本發明另一目的係提出一種在網路流量中分類所屬應用程 式之方法,其提出了一個利用傳輸層行為特徵,計算應用程式連線封 包大小分佈與結合埠關婦性之方法,來做細減量中細程式之 依據。利用應用程式在傳輸層中行為計算出的特徵值(向量值)與已 知的代表特徵值比對辨認,並且姻埠關聯的特性_併將與應用程式 相關的連線辨識出來。 為了達到上述目的,本發明一實施例之在網路流量中分類 所屬應用程式之方法,包括:計算—指定應雜式之複數個代表特 徵值;將複數個實際網路封包流量拆解成一第二組連線;搜尋第二組 連線是否存在於一埠關連表格中;以及,若是沒有存在於埠關連表格 中,則計算第二組連線之特徵值,並與些代表特徵值作比較,選擇最 接近之代表特徵值以歸屬為指定應用程式。 【實施方式】 第1圖所示為本發明一實施例之網路流量分類方法之執行步 驟’包括第一階段100之訓練過程與第二階段200之分類過程。 在第一階段100之訓練過程中,分析已錄製的流量並根據應用 程式的不同作分類,以求得各分類的代表特徵,其包括:步驟110流 量收集(Traffic Collection),流程一開始是經由流量收集,先收集想 要比對的應用程式流量’得到足夠的封包個數後(至少需要超過4〇〇 個封包個數),步驟120計算各連線特徵(ConnectionCharacterizing), 將流量拆解成多個連線(connection);步驟130計算應用程式代表特 徵值,以各連線為處理單位再分別計算其代表特徵值,包含有支配值 (Dominating Size,DS)、支配值比例(Dominating Size Proportion, DSP),及變動週期(Change Cycle, CC);以及最後的步驟140應用 程式代表特徵值之集合’得到一個應用程式代表特徵值之集合,並儲 200926674 存經由上述步驟所計算出的應用程式代表特徵值,以作為第二階段 200分類過程之線上模式比對流量的基準。 根據上述各步驟的動作,在步驟11〇流量收集(Traffic Collection)中,採用應用程式流量收集技術,利用網路流量過濾器的 概念,執行想要比對的應用程式,限定應用程式及其使用的埠號,使 付v、有所高要的應用程式封包才能通過網路介面,並且在網路流量出 入口端利用流量錄製技術將所需的流量錄製下來做為分析之用。 在步驟120計算各連線特徵(c〇nnecti〇n Characterizing)中, ◎ 依據來源IP、來源埠號、目的jp及目的琿號,將錄製到的流量分類, 拆解成多條連線。以各連線為處理單位,分別計算各連線的特徵值, 亦為向量值(vector) ’包含有支配值(DS)、支配值比例(Dsp), 及變動週期(CC)。其中支配值與支配值比例各是指連線中佔有較 大比例之各個封包大小及相對應的佔有比例數,變動週期則是當某一 連線中所含的封包大小有劇烈變化時,用來作為輔助辨識的依據。 在步驟130計算應用程式代表特徵值中,有了各連線的特徵值 後’再從處在相同交談(session)的各連線推出可代表此類的代表特 徵值。本實施例中是對各連線的特徵值平均計算,將計算所得的平均 〇 值作為某一類應用程式的代表特徵值。 接著在第二階段200之分類過程中,利用第一階段1〇〇之訓練 過程得到的各應用程式代表特徵值,作為與網路中真實流量比對的基 準,藉著與各代表特徵值之間的差距來推論擷取到的封包屬於哪種應 用程式。包括.步驟205接入網路中真實流量;步驟21〇流量拆解, 將流量拆解成多個連線(eGnneeti⑻,並依照第—階段之步驟12〇計 各連線特徵,步驟220建立埠關連表格(p〇rt Association Table, PAT),以各連線為處理單位,内以<SrcIp,Srcp〇rt>、<Dstlp,Dstp〇rt> 作為索引’至埠關連表格(pAT)去搜尋是否已有相關資訊存在;如 果沒有,則進入步驟230封包辨識,先分別計算各連線之特徵值,再 9 200926674 與第-階段步驟m中獲得之各應用程式代表特徵值做歐幾 里得距離 张,卿^最储之應雌«徵值做為 2連線之城應雜式;如果該連線已有資訊存在物連表格( 中,則可依據埠關連表格(PAT)中之記錄,直 f用程式。驗,㈣發職魏所計糾之魏健林在於=In U.S. Patent No. 6,157,955, an architecture for a network interface plus a classification engine mechanism is proposed to analyze and classify packets on the network. The n-knife of the classification engine contains two main body packet header parsjjg and hash table lookups, while the engine is over-represented by the host to define the mosquito storage. The kind of process can be passed. This patent provides one, flexible, bribes to add a new alkali policy, and can be dynamically determined to save as much information as possible about the package content to be detected. This patent is similar to the architecture of the request (using a set of classification mechanisms to classify traffic in the network), but does not further detect the application using the encryption protocol. In U.S. Patent No. 6,597, a set of architectures for analyzing, predicting, and 2 network real-time traffic is provided, which includes information that can store and process packet time at different time points and different time ranges. The packet is received and classified into the packet time information classification packet. This patent has passed the following: The information obtained after statistical calculation is used as the basis for packet judgment. No. (4) is when the packet arrives, and the case is called another. The patent does not further detect the encrypted packet. The hash table Batch 6, 754, 662 uses a set of sending engines and a set of engines that will receive a two-member ί package. The hash table stores a set of judgment identifiers, and the send quotes are received after receiving the packets. The packet information to the packet, calculate the hash value of the packet, 6 200926674 and then try to find the hash value of the hash value. According to the network traffic _, including access _, recent access time == type, decrement long scale _, mosquito presence transfer table __ length and length Ο ❹ in US patent uS. Patent 6,839,751 A set of data packages is used for a set of packets; the database is mainly used to store the processed conversational traffic packets, and after receiving the package, the bribe database is checked whether the call has been called, and = according to the included statistical information. , the package has __ fruit packet arrived B«, and the packet and the last (four) packet __ poor and so on. If not processed, add a new item to the repository.枓The traditional Lai analysis technology uses the known bee (MU _) method to determine the application, but many of the harmful substances that exist on the Internet are used by the dynamic 埠 (dynamiepcm) street. identify. ' Today's extensive use of (10) "sealing _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ In the dodge, the right side of the 'transportation method has the age of the misunderstanding or the ride of the tenth of today's packet details, there is a problem of infringement of personal privacy. — The current method of collecting the ship's ship 1 need to hide, It is necessary to correctly judge the defects of the ability, and _time_long, can not quickly determine the gateway of the network traffic management policy or the firewall. 罟, [invention content], by ^ for / / to solve the above problem 'the invention The purpose is to provide an application that can be used to hide the communication protocol to provide sufficient information on the network. 200926674 Another object of the present invention is to provide a method for classifying an application in network traffic. This paper proposes a method to calculate the application connection size distribution and the method of combining the women's sex with the characteristics of the transmission layer, and to use the application in the transport layer. The eigenvalues (vector values) are compared with known representative eigenvalues, and the attributes associated with the in-laws are identified and the lines associated with the application are identified. To achieve the above object, an embodiment of the present invention The method for classifying an application in the network traffic includes: calculating - specifying a plurality of representative feature values of the hybrid type; disassembling the plurality of actual network packet traffic into a second group connection; searching for the second group connection Whether the line exists in a related table; and if it does not exist in the related table, calculates the feature value of the second set of links, and compares with the representative feature values to select the closest representative feature value to belong to [Embodiment] FIG. 1 shows an execution step of a network traffic classification method according to an embodiment of the present invention, which includes a training process of the first phase 100 and a classification process of the second phase 200. During the training of stage 100, the recorded traffic is analyzed and classified according to different applications to obtain representative features of each category, including: step 110 Collection (Traffic Collection), the process begins with traffic collection, first collect the application traffic you want to compare 'after getting enough packets (at least more than 4 packets), step 120 calculates each The line feature (ConnectionCharacterizing), the flow is disassembled into a plurality of connections; the step 130 calculates the representative value of the application, and calculates the representative feature value by using each connection as the processing unit, including the Dominating Size. , DS), Dominating Size Proportion (DSP), and Change Cycle (CC); and finally the step 140 application represents a set of eigenvalues to get a set of application representative eigenvalues and store 200926674 The application representative feature value calculated through the above steps is used as a benchmark for the online mode comparison traffic of the second stage 200 classification process. According to the actions of the above steps, in the traffic collection (Traffic Collection) in step 11, using the application traffic collection technology, using the concept of the network traffic filter, executing the application to be compared, limiting the application and its use. The nickname allows the application package of the high and high bandwidth to pass through the network interface, and uses the traffic recording technology to record the required traffic at the network traffic gateway for analysis. In the calculation of each connection characteristic (c〇nnecti〇n Characterizing) in step 120, ◎ according to the source IP, the source nickname, the destination jp, and the destination nickname, the recorded traffic is classified and split into a plurality of connections. The eigenvalues of the respective lines are calculated by using the respective lines as processing units, and the vector value (vector) includes the dominant value (DS), the dominant value ratio (Dsp), and the variation period (CC). The ratio of the dominant value to the dominant value refers to the size of each packet occupying a larger proportion in the connection and the corresponding proportion of the proportion of the packet. The variation period is when the size of the packet contained in a connection changes drastically. Come as the basis for auxiliary identification. In step 130, the application representative feature value is calculated, and after having the feature values of the respective links, the representative value values representing the class are derived from the respective links in the same session. In this embodiment, the eigenvalues of each connection are averaged, and the calculated average 〇 value is used as a representative eigenvalue of a certain type of application. Then, in the classification process of the second stage 200, each application obtained by the training process of the first stage 1 represents the feature value as a reference for comparison with the real flow in the network, by means of the representative value of each representative. The gap between the two is to infer which application the captured packet belongs to. Including step 205 accessing the real traffic in the network; step 21: traffic disassembly, disassembling the traffic into multiple connections (eGnneeti (8), and according to step 12 of the first stage, the connection characteristics are established, step 220 is established. P〇rt Association Table (PAT), with each connection as the processing unit, with <SrcIp, Srcp〇rt>, <Dstlp, Dstp〇rt> as the index 'to the related table (pAT) Search for the existence of relevant information; if not, proceed to step 230 for packet identification, first calculate the characteristic values of each connection separately, and then 9 200926674 and the application representative representative values obtained in step m of the first stage to do the Euclid The distance from Zhang, Qing ^ the most stored females «value of the city as a two-line city should be mixed; if the connection has information on the existence of the form (in the case, it can be based on the PAT) Record, straight f program. Test, (four) Wei Wei, who is sent to work, Wei Jianlin is =
I階段⑽所得制輯特難集合之差距也過大而無法 =細屬之顧程式種類時,可將該連線判“『未知應用程式』; 最後的步驟240狀細赋,料贿識㈣包財喊判定成『已 知的某類應用程式』,或是ir未知的應用程式』。 清參閱第2圖所示為本發明一實施例之網路流量分類方法在第 二階段200之分類過程示意圖,其步驟與第】圖之步驟相同。分類過 程包括:步驟2〇5接入網路中真實的封包流量;步驟2iq將封包流量 已有資訊存在埠關連表格,則 進入最後步驟240的制程式判定為程式a或程式B,若比對連線沒 有貧訊存在琿關連表格,則進入步驟23〇的封包辨識以判定為程式 A、程式B或未知程式。 根據上述各步驟_作’在步驟21G流量拆解巾,接入網路上 真實流量後,依據來源IP (純)、來源埠號(Srcp〇rt)、目的正 (DstIP) ’及目的埠號(DstPGrt),將想要分析的流量分類,拆解成 多條連線。 在步驟220埠關連表格中’先尋找該連線的<SrcIp,Srcp〇rt>、 <DstIP’DstPort>是否出現在埠關連表袼(pAT)中,蟑關連表格(ρΑτ) 中儲存的是已_識出的連線及所屬之交談(sesskm)資訊;以 <SrcIP DstIP ’ SrcPort ’ DstPort>來代表一條被辨認出的連線,依照 下列步驟操作: 1. §己錄使用該SrcIP與DstIP的主機(host)有使用辨認出的應用程式。 2. 將其SrcPort、DstPort記錄於埠關連表袼(PAT)中。 200926674 3,右有某條連線符合<Srcport,srcport+l>或是<DstPort,DstPort+l> 的情況’則認定該連線亦屬於該交談(sessi〇n)。 ^在步驟230封包辨識中’依照210處所拆解的各連線,分別計 算其特徵值’再與第—階段之_程式代表特徵值之集合得到的應用 程式代表特徵值作歐幾里得距離(EuclideanDistance)運算;若是連 線封包大小分佈與某個細程式代表特徵值相似,則之間的歐幾里得 距離-疋概較接近,故可用來觸—個連線與哪種躺程式最為類 、同時我們也會對辨識出的各連線作交談(sessi〇n)關聯性分析,In the I stage (10), the gap between the collections and the special collections is too large to be able to determine the type of the program. The connection can be judged as "unknown application"; the last step is 240, and the bribe is recognized. The financial screaming is determined to be "a known type of application" or an unknown application." Referring to Figure 2, the classification process of the network traffic classification method in the second stage 200 according to an embodiment of the present invention is shown. The schematic diagram has the same steps as the first diagram. The classification process includes: step 2〇5 accesses the real packet traffic in the network; step 2iq stores the packet traffic existing information in the associated table, and enters the final step 240. If the program is determined to be program a or program B, if the comparison connection does not exist in the connection form, then the packet identification in step 23 is entered to determine that it is program A, program B or unknown program. Step 21G traffic dismantling towel, after accessing the real traffic on the network, according to the source IP (pure), source nickname (Srcp〇rt), destination positive (DstIP) ' and destination nickname (DstPGrt), will want to analyze Traffic classification, disassembly In the 220nd related table, 'Is looking for the connection's <SrcIp, Srcp〇rt>, <DstIP'DstPort> in the related list (pAT), 蟑Connected form (ρΑτ) stores the connected lines and the sesskm information; the <SrcIP DstIP ' SrcPort ' DstPort> represents an identified connection, follow the steps below: 1. § The host that uses the SrcIP and DstIP has an application that recognizes it. 2. Record its SrcPort and DstPort in the PAT. 200926674 3. A connection to the right matches the <lt;; Srcport, srcport+l> or <DstPort, DstPort+l>'s case determines that the connection also belongs to the conversation (sessi〇n). ^ In step 230 packet identification, 'disassembled according to 210 Connect the line, calculate the eigenvalue ' and then the _ _ program represents the set of eigenvalues to obtain the application representative eigenvalues for the Euclidean Distance (EuclideanDistance) operation; if the connection packet size distribution and a certain The program represents similar eigenvalues, The distance between the Euclidean distance and the 疋 is relatively close, so it can be used to touch the connection and which type of lying program is the most class, and we will also talk about the identified connections (sessi〇n). analysis,
將士,相同交《㈤麵彡的各連線組合在—起’以期得到較全面性 的資訊。 以上所述是針對某-應用程式比對的操作流程敘述,如果 程式需要比對’本發明僅需要針對不同的應用程式多次 運作流程即可。 r + 、示α上述’本發明為一在網路流量中分類 式之依據。彻朗程式在傳輸層中行 -併將斑i與已知的代表特徵點比對辨認’並且利科關聯的特性 將與應用程式相關的連線辨識出來。本發 内容辨認的問題與動態蜂號之使用而無法辨認之問題,;且^封= 可以用來做為線上閘道器使用之辨認機制。 、個 ^上所述之實_僅係為說明本發明之技術思 二此項技藝之人士能夠瞭解本發明之内容 卫像乂貫施’當不能以之限定本發明之專利範 本發明所㈣之料所作之均㈣ ^ 發明之專利範_。 冑仍應心在本 11 200926674 【圖式簡單說明】 第1圖所示為本發明-實施例之網路流量分類方法之執行步驟。 第2圖所示為本發明—實_之網猶量分法之分類過程示竟 【主要元件符號說明】 100 第一階段 ❹ 200 第二階段 S110-S140 訓練過程之步驟 S205-S240 分類過程之步驟 12The soldiers, the same connection to the "(5) face-to-face connection, in the beginning" to obtain more comprehensive information. The above is a description of the operation flow for an application-to-application comparison. If the program needs to be compared, the present invention only needs to operate the process multiple times for different applications. r + , show α The above invention is a basis for classification in network traffic. The program is lined up in the transport layer - and the spot i is aligned with known representative feature points' and the properties associated with the Rico are associated with the application-related connections. The problem identified in this issue and the use of the dynamic bee number are unrecognizable; and ^ seal = can be used as an identification mechanism for the use of the online gateway. The above description is only for those skilled in the art to explain the present invention. The content of the present invention can not be used to limit the invention of the present invention (4). The average of the materials (4) ^ The invention patent _.胄 应 胄 11 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 Figure 2 shows the classification process of the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Step 12