CN110986979A

Movatterモバイル変換

Info

Publication number: CN110986979A
Application number: CN201911183909.4A
Authority: CN
Inventors: 李传煌; 方春涛; 卢正勇
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-04-10
Anticipated expiration: 2039-11-27
Also published as: CN110986979B

Abstract

Translated fromChinese

本发明公开一种基于强化学习的SDN多路径路由规划方法，该方法为：将强化学习应用于SDN多路径路由规划中，使用QLearning算法作为强化学习模型，根据流量的不同QoS等级，产生不同的奖励值；根据输入的网络拓扑矩阵、当前待转发流特征矩阵，为不同的QoS等级的流设置不同的奖励函数，规划出多条路径转发该流；并在链路带宽不够用的情况下，将一条较大的流划分为多条小流量，从而提高链路带宽利用率。本发明利用强化学习与环境不断交互、调整策略的特点，相比于传统的单路径路由规划，可以实现高链路利用率，能有效减少网络拥塞。

The invention discloses an SDN multi-path routing planning method based on reinforcement learning. The method is as follows: applying reinforcement learning to SDN multi-path routing planning, using QLearning algorithm as a reinforcement learning model, and generating different QoS levels according to different traffic levels. Reward value; according to the input network topology matrix and the current feature matrix of the flow to be forwarded, set different reward functions for flows of different QoS levels, and plan multiple paths to forward the flow; and when the link bandwidth is not enough, Divide a larger flow into multiple smaller flows to improve link bandwidth utilization. Compared with the traditional single-path routing planning, the present invention can realize high link utilization rate and effectively reduce network congestion by utilizing the characteristics of intensive learning and environment constantly interacting and adjusting strategies.

Description

SDN multi-path routing planning method based on reinforcement learning

Technical Field

The invention relates to the field of network communication technology and reinforcement learning, in particular to an SDN multi-path route planning method based on reinforcement learning.

Background

In recent years, with the popularization of the internet, particularly with the appearance of related technologies such as cloud computing and big data, the internet has entered a rapid development period. The rapid development of the internet enables the data volume of network transmission services to increase rapidly, and particularly, in recent years, with the rise of short video and live broadcast platforms, the interaction of network services is more real-time, and a terminal user puts higher demands on the Quality of Service (QoS) of the network services. However, in the case of limited network resources, the continuous increase of internet traffic data may cause problems such as a sharp increase of bandwidth consumption, difficulty in guaranteeing quality of service, and an increase of security problems. Obviously, the traditional network architecture is difficult to meet the diversified requirements of users. In view of the foregoing, there is a need in the internet industry for a new network architecture that addresses existing network problems, and that is more flexible and efficient than conventional architectures to meet the ever-increasing traffic data needs of society.

The SDN is a novel network architecture, is widely concerned by various borders, and solves some problems which cannot be avoided in the traditional network. In the traditional network architecture, each device can independently make a forwarding rule and transmit information through a series of network protocols (such as TCP/IP), under the system structure, the control and the forwarding of the network device are closely coupled, the network device can only plan a path by taking the network device as a center as a flow service and does not have network global resource information, and the problems of network link congestion and the like are easily caused. SDN forwards and control separation, can obtain link information in real time through the OpenFlow protocol, be favorable to the centralized control of network for the control layer obtains network global resource information, and carry out unified management and distribution according to the demand of business, and simultaneously, centralized control still makes whole network regard as a whole, convenient maintenance. Compared with the traditional IP network, the SDN network solves the problems of inaccurate routing information, low routing efficiency and the like of the traditional network, and lays a foundation for realizing intelligent routing planning according to the requirements of different flows. Therefore, the research on the SDN network architecture is of great significance. Routing is an indispensable component of both traditional networks and SDN networks, however, the basic adopted by the current mainstream SDN routing modules is Dijkstra (shortest path) algorithm, if all data packets depend on the shortest path algorithm only, data flows are easy to cause link congestion due to the selection of the same link, and other links are placed in an idle state, which greatly reduces link utilization. On the other hand, the shortest path algorithm is an algorithm for finding the shortest path in graph theory, and when the algorithm is operated, the shortest path from the source node to all other nodes in the topology is actually obtained, so the time complexity of the algorithm is high. There are also protocols that support multipath, such as ECMP, but these protocols do not take into account the quality of service requirements of different traffic streams. Therefore, a better routing strategy is needed in the SDN network to generate routes, improve the performance of the network, and guarantee the service quality of different service flows.

Disclosure of Invention

The invention provides a high-bandwidth-utilization-oriented SDN intelligent routing planning technology, which mainly adopts the shortest path as a routing planning algorithm around the current SDN network, so that the problem of low link bandwidth utilization rate and the like is caused.

The technical scheme adopted by the invention for solving the technical problem is as follows: an SDN multi-path route planning method based on reinforcement learning comprises the following steps:

step 1: acquiring available bandwidth information, total bandwidth information, node information and link information of a network to construct a network topology matrix, and acquiring a characteristic matrix of a stream to be forwarded;

step 2: a QLearning algorithm is adopted as a reinforcement learning model, and the network topology matrix and the characteristic matrix of the flow to be forwarded in the step 1 are input into a reinforcement learning model training Q value table; the reward function R in the QLearning algorithm is as follows:

wherein: r_t(S_i,A_j) Indicating slave status S of a data packet_iSelection action A_jThe obtained reward is represented in a routing planning task as the reward generated when the next hop selected by the data packet at the node i is the node j, β is the flow QoS grade, η is the bandwidth utilization rate, d is the destination node, delta (j-d) is an impulse function and represents that when the next hop of the data packet is the destination node, the value is 1, T is the connection state of the network topology nodes, 1 when the two nodes are connected and 0 when the two nodes are not connected, and g (x) is) As a cost function, the following is shown:

in the formula I_mIs the total number of links of the network topology. x is the hop count passed by the data packet in forwarding;

and step 3: and obtaining a path Routing according to the Q value table, putting the path Routing into a path set Routing (S, D), and judging whether the minimum link bandwidth of the Routing is smaller than the bandwidth of the stream. If so, a slice of size is divided from the stream

Wherein B is_{Can be used}Represents the minimum link available bandwidth for the current output path, β represents the Qos level, Σ, of the current flow_iβ_iIndicating the overall QoS class. The divided streams are passed from the source node to the target node through the current output path. Taking the residual flow as a new flow to flow back to the step 2 to train the Q value table again; if not, the planning is finished, and the planned multi-path route is obtained from the Routing (S, D).

Further, the traffic characteristic matrix to be forwarded includes a source address, a destination address, a QoS class, and a traffic size of the flow.

Further, the process of training the Q-value table by the QLearning algorithm is specifically as follows:

and setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(2) an action epsilon-greedy strategy P is adopted, and an action a is selected;

(3) executing action a, transferring to a state s', calculating a reward value by using a reward function R, and updating a Q value table;

(4) and judging whether s' is a destination node or not. If not, let s ═ s', return to (2).

Further, the SDN multipath routing planning method based on reinforcement learning is characterized in that the cost function is defined as that the cost increases with the increase of the number x of hops passed by the data packet forwarding, and the cost function g (x) is e (0,1), and the cost function should satisfy: the curve of the cost function g (x) is an upward convex function curve, and when the total hop number of the data packet tends to infinity, the cost function value tends to 1.

The method has the advantages that reinforcement learning is applied to SDN multi-path routing planning, a QLearning algorithm is used as a reinforcement learning model, different reward functions are set for flows with different QoS levels according to an input network topology matrix and a current flow characteristic matrix to be forwarded, a plurality of paths are planned to forward the flows, and a larger flow is divided into a plurality of small flows under the condition that the link bandwidth is not enough, so that the utilization rate of the link bandwidth is improved.

Drawings

Figure 1 is an SDN multi-path routing planning architecture diagram;

figure 2 is a SDN network topology diagram;

FIG. 3 is a graph of a cost function;

fig. 4 is a flow chart of reinforcement learning-based multi-path routing planning.

Detailed Description

Aiming at the fact that the existing SDN control adopts Dijkstra algorithm as the shortest route searching algorithm, the method tries to apply reinforcement learning to the SDN route. And directly using the network topology environment for the training of the Q value table by utilizing the characteristic of SDN forwarding control separation. Considering that different services have different requirements on QoS, the invention provides routes with different service qualities for different services; and under the condition that the link bandwidth is not enough, a larger flow is divided into a plurality of small flows, so that the link bandwidth utilization rate is improved.

As shown in fig. 1, the present invention provides a reinforcement learning based SDN multi-path route planning method, which includes the following steps:

step 1: acquiring available bandwidth information, total bandwidth information, node information and link information of a network to construct a network topology matrix, and constructing a network topology graph as shown in figure 2 by using Mininet, wherein the network topology graph comprises 9 OpenFlow switches and 5 hosts; the method comprises the steps of obtaining a characteristic matrix of the flow to be forwarded, setting the bandwidth of each network link to be 200 according to a multi-path routing planning algorithm and an SDN network topology, setting a sending end to be h 1-h 5 and a receiving end to be h 1-h 5, wherein the sending end randomly sends data to other receiving ends with the probability of 20%, and all hosts send 30 static flows in total, wherein the static flows refer to flows which occupy the bandwidth of the link until the experiment is finished once being injected into the network.

Step 2: as shown in fig. 1, using QLearning algorithm as the reinforcement learning model, the multi-path routing algorithm proposed by the present invention uses markov decision process for modeling, and therefore, the model MDP quadruplet proposed by the present invention is defined as follows:

(1) state collection: in a network topology, each switch represents a state, and thus, according to the network topology, a set of network states is defined herein as follows:

S＝[s₁,s₂,s₃,…s₉]

wherein s is₁～s₉Representing 9 OpenFlow switches in the network. The source node information of the data packet indicates an initial state of the data packet, and the destination node information indicates a termination state of the data packet. When a certain data packet reaches the destination node, the data packet reaches the termination state. Once the current data packet reaches the termination state, the termination of one round of training is indicated, and the data packet will return to the initial state again for the next round of training.

(2) An action space: in an SDN network, the transmission path of a data packet is determined by the network state, i.e. the data packet can only be transmitted at connected network nodes. According to the network topology, the network connection state is defined as the following formula:

since packets can only be transmitted at connected network nodes, the following set of actions for each state S [ i ] ∈ S can be defined herein according to the set of network states and the network connection state:

A(s_i)＝{s_j|T[s_i][s_j]＝1}

indicates that the current state is at s_iThe state-selectable action set appears as s on the network topology_iDirectly connected nodes s_jI.e. the current state s_iWill only select the state s connected to it_j. For example: state s₁The action set of (1) is: a(s)₁)＝{s₂,s₄}。

(3) And (3) state transition: in each round of training, when the data packet is in state s_iIf the action is not the selected state of the round, the data packet moves to the next state. Another key issue with reinforcement learning is the generation of reward values. When the Agent generates state transition, the system feeds back a reward to the Agent according to the reward function R.

(4) The final purpose of the multi-path routing planning based on reinforcement learning is to plan reasonable multi-paths through training, so the setting of a reward value R is also important, the bandwidth utilization rate and the delay are mainly considered herein, the delay mainly refers to the hop count of the path, and in order to plan different paths for links with different QoS levels, the hop count of the planned path is considered to be smaller as the traffic level β is larger.

1. QoS level β and link utilization η need to be considered;

2.β large flows are encouraged to allocate paths with fewer hops;

in summary, the reward function formula designed herein is as follows:

wherein: r_t(S_i,A_j) Indicating slave status S of a data packet_iSelection action A_jThe obtained reward is represented in a routing planning task as the reward generated when the next hop selected by the data packet at the node i is the node j, β is the flow QoS grade, η is the bandwidth utilization rate, d is the destination node, delta (j-d) is an impulse function and represents that when the next hop of the data packet is the destination node, the value is 1, T is the connection state of the network topology nodes, the two nodes are 1 when connected and 0 when not connected, and the reward function represents that the data packet is in the state S_iWhen the next hop that can be selected (connected) is j (action a)_j) Then, from T [ S ]_i][A_j]The bonus function when 1 yields a bonus value, otherwise the bonus value is set to-1.

g (x) is a cost function defined as the cost increases as the number of hops x passed by the packet increases, and the cost function g (x) is e (0,1) with l_mFor the total number of links in the network topology, considering that when the network topology is large, it is impractical to walk all paths, and a data packet can only be forwarded through a part of links, so the cost function should satisfy: the early stage is increased quickly and becomes stable to the later stage, if the total number of hops passed by the data packet reaches l_mThen the cost function value is maximized. In summary, the cost function is as follows:

as shown in fig. 3, it can be seen that the cost function is an increasing function, the range of the increasing function is (0,1), the cost increases with the increase of the number of hops x, and the function grows more rapidly in the early stage and tends to be stable in the later stage, which meets the requirement of the cost function.

The second requirement of designing the reward function is to encourage β flows with large flows to allocate paths with small hop count, so the cost function g (x) in the reward function is multiplied by the traffic QoS class β, at this time, under the same condition, the more the number of the path hops passed by the packet forwarding, the larger the cost required by the flows with large QoS class, and the paths with small hop count will be selected by the flows with large QoS class during planning.

After the MDP quadruples are determined, a Q-value table is trained by using a Qlearning algorithm, which comprises the following specific steps.

And setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(4) and judging whether s' is a destination node or not. If not, the process returns to step (2) with s ═ s'.

As shown in fig. 4, the learning rate α is set to 0.8, the discount rate γ is set to 0.6, and the value of the action strategy ∈ -greedy strategy ∈ is equal to 0.1.

And obtaining a path Routing according to the trained Q value table, putting the path Routing into a path set Routing (S, D), and judging whether the minimum link bandwidth of the Routing is smaller than the bandwidth of the stream. If so, a slice of size is divided from the stream

Of (a) wherein B_{Can be used}Represents the minimum link available bandwidth for the current output path, β represents the Qos level, Σ, of the current flow_iβ_iIndicating the overall QoS class. And the small flow reaches the target node from the source destination node through the current output path. Use of

Updating the flow, wherein B represents the size of the planned flow, namely returning the rest flows as new flows to step 2 for the training of the Q value table; if not, the planning is finished, and the planned multi-path route is obtained from the Routing (S, D).

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

Translated fromChinese

1.一种基于强化学习的SDN多路径路由规划方法，其特征在于，该方法包括以下步骤：1. a SDN multi-path routing planning method based on reinforcement learning, is characterized in that, this method comprises the following steps:

步骤1：采集网络的可用带宽信息、总带宽信息、节点信息和链路信息构建网络拓扑矩阵，并获取待转发流特征矩阵；Step 1: collect available bandwidth information, total bandwidth information, node information and link information of the network to construct a network topology matrix, and obtain a flow feature matrix to be forwarded;

步骤2：采用QLearning算法作为强化学习模型，将步骤1中网络拓扑矩阵和待转发流特征矩阵输入到强化学习模型训练Q值表；所述QLearning算法中奖励函数R如下：Step 2: Using the QLearning algorithm as the reinforcement learning model, input the network topology matrix and the feature matrix of the flow to be forwarded in step 1 into the reinforcement learning model training Q-value table; the reward function R in the QLearning algorithm is as follows:

其中：R_t(S_i,A_j)表示数据包从状态S_i选择动作A_j时得到的奖励，在路由规划任务中表现为数据包在节点i时选择的下一跳为节点j时产生的奖励；β为流量QoS等级；η为带宽利用率；d为目的节点；δ(j-d)为冲激函数，表示当数据包下一跳节点为目的节点时，该值为1；T为网络拓扑节点的连接状态，两个节点相连时为1，不相连时为0；g(x)为代价函数，如下所示：Among them: R_t (S_i , A_j ) represents the reward obtained when the data packet selects action A_j from state Si, and in the routing planning task, it is expressed as the next hop selected by the data packet at node_i is generated when node j is β is the traffic QoS level; η is the bandwidth utilization rate; d is the destination node; The connection status of topological nodes, 1 when two nodes are connected, and 0 when they are not connected; g(x) is the cost function, as shown below:

式中，l_m为网络拓扑总链路数。x为数据包转发时经过的跳数；In the formula,_lm is the total number of links in the network topology. x is the number of hops passed by the packet forwarding;

步骤3：根据Q值表得到一条路径Routing，放入路径集合Routing(S,D)中，并判断Routing的最小链路带宽是否小于流的带宽。若是，则从流中划分出一条大小为

的流，其中B_可用表示当前输出路径的最小链路可用带宽，β表示当前流的Qos等级，∑_iβ_i表示总QoS等级。将被划分出的流从源节点通过当前输出路径达到目标节点。把剩余流作为新的流回到步骤2重新进行Q值表的训练；若否，规划结束，从Routing(S,D)中得到规划后的多路径路由。Step 3: Obtain a route Routing according to the Q value table, put it into the route set Routing(S, D), and judge whether the minimum link bandwidth of Routing is smaller than the bandwidth of the flow. If so, divide a stream from the stream with a size of

where B_available represents the minimum link available bandwidth of the current output path, β represents the QoS level of the current flow, and ∑_i β_i represents the total QoS level. The divided stream is routed from the source node to the destination node through the current output path. Return the remaining flow as a new flow to step 2 to re-train the Q-value table; if not, the planning is over, and the planned multi-path routing is obtained from Routing(S, D).

2.根据权利要求1所述的一种基于强化学习的SDN多路径路由规划方法，其特征在于，所述待转发流量特征矩阵把一条流的信息以矩阵形式表现出来，其中包含流的源地址、目的地址、QoS等级、流量大小。2. a kind of SDN multi-path routing planning method based on reinforcement learning according to claim 1, is characterized in that, described flow characteristic matrix to be forwarded expresses the information of a flow in matrix form, wherein contains the source address of flow , destination address, QoS level, traffic size.

3.根据权利要求1所述的一种基于强化学习的SDN多路径路由规划方法，其特征在于，所述QLearning算法训练Q值表的过程具体如下：3. a kind of SDN multi-path routing method based on reinforcement learning according to claim 1, is characterized in that, the process of described QLearning algorithm training Q value table is as follows:

设置单次训练最大步数后，进行如下步骤；After setting the maximum number of steps for a single training, perform the following steps;

(1)初始化Q值表和奖励函数R；(1) Initialize the Q value table and the reward function R;

(2)采取动作ε-贪心策略P，选择动作a；(2) Take action ε-greedy strategy P and choose action a;

(3)执行动作a，转移到状态s′，用奖励函数R计算奖励值，更新Q值表；(3) Execute action a, transfer to state s', calculate the reward value with reward function R, and update the Q value table;

(4)判断s′是否为目的节点。若不是，使s＝s′，回到步骤(2)。(4) Determine whether s' is the destination node. If not, set s=s' and go back to step (2).

4.根据权利要求3所述的一种基于强化学习的SDN多路径路由规划方法，其特征在于，所述代价函数定义为随着数据包转发时经过的跳数x的增加，代价增加，且代价函数g(x)∈(0,1)，代价函数应满足：代价函数g(x)的曲线为上凸函数曲线，当数据包经过的总跳数趋向无穷时，代价函数值趋向1。4. a kind of SDN multi-path routing method based on reinforcement learning according to claim 3, is characterized in that, described cost function is defined as with the increase of the hop number x that passes through when the data packet is forwarded, the cost increases, and The cost function g(x)∈(0,1), the cost function should satisfy: the curve of the cost function g(x) is an upwardly convex function curve, when the total number of hops passed by the data packet tends to infinity, the cost function value tends to 1.