CN112822718B

Movatterモバイル変換

Info

Publication number: CN112822718B
Application number: CN202011620034.2A
Authority: CN
Inventors: 张非凡; 李业
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-10-12
Anticipated expiration: 2040-12-31
Also published as: CN112822718A

Abstract

Translated fromChinese

本发明公开了一种基于强化学习和流编码驱动的分组传输方法及系统，分组传输方法具体包括以下步骤：首先初始化流编码相关参数，然后发送端根据接收端的反馈，估计此时网络的拥塞状况和接收端的有序分组接收进度，将这一系列状态作为特征向量供模型实时学习，然后根据奖赏函数对当前行为进行选择，最终在发分组的过程中，实现发送端发送动作的在线训练。分组系统包括发送端、接收端、状态空间单元、奖赏函数单元、价值拟合单元和动作选择单元。本发明依据此时的网络状况以及丢包率，动态地调整分组发送间隔、智能地选择发送分组类型，实现流编码码率控制和拥塞控制的联合优化，提高网络的吞吐量、降低数据传输延迟、并且能够适应多变的链路条件。

The invention discloses a packet transmission method and system based on reinforcement learning and stream coding driving. The packet transmission method specifically includes the following steps: firstly initialize relevant parameters of stream coding, and then the sender estimates the current network congestion status according to the feedback from the receiver And the orderly group receiving progress of the receiver, this series of states is used as a feature vector for the model to learn in real time, and then the current behavior is selected according to the reward function. The grouping system includes a sender, a receiver, a state space unit, a reward function unit, a value fitting unit, and an action selection unit. According to the current network conditions and packet loss rate, the invention dynamically adjusts the packet transmission interval and intelligently selects the type of transmission packets, realizes the joint optimization of stream coding rate control and congestion control, improves network throughput, and reduces data transmission delay. , and can adapt to changing link conditions.

Description

Packet transmission method and system based on reinforcement learning and stream coding driving

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a packet transmission method and system based on reinforcement learning and stream coding driving, in particular to a packet transmission method and system based on reinforcement learning and stream coding driving and oriented to a wireless link with a large time delay bandwidth product.

Background

The wireless long and fat link, namely the wireless link with large time delay bandwidth product, is an important component of the air-ground integrated network in the future. At present, in a long and fat wireless link, the problem of low bandwidth utilization rate generally exists in TCP (transmission control protocol) which is relied on conventionally. Most TCP variants treat data packet loss as a congestion signal and will therefore reduce the transmission rate. In wireless links, however, data packet loss may be due to random link errors rather than congestion, and such implementation may result in unnecessary slowdowns. In many new air-space-ground integrated network scenarios, link layer automatic repeat request (ARQ) cannot be used due to large propagation delay, and therefore, data packet loss due to link error inevitably occurs, so that the problem is particularly serious. Secondly, to avoid congestion, the sending rate of TCP is gradually increased at the beginning of the transmission (called slow start). In a long fat link where both bandwidth and propagation delay are large, it may take a long time to fill the link with data. Especially in short-term data volume connections, this results in a severe drop in link bandwidth utilization.

A number of TCP congestion control variants have been proposed in the art to address these problems, typical examples include TCPWestwood + and Google's BBR, among others. However, the congestion control scheme based on the rules is not enough to meet the high heterogeneous and dynamic characteristics of the future air-space-ground integrated network. In future heterogeneous and large-scale wireless networks, higher flexibility and more stringent throughput/delay requirements are required. Recently, the fast UDP network connection protocol (QUIC) proposed by Google is widely recognized as an alternative to TCP in future network packet transmission. QUIC is based entirely on UDP, takes advantage of the connectionless nature of UDP to reduce the 3-way handshake delay of TCP to establish a connection, takes advantage of the out-of-order nature of UDP to multiplex HTTP streams more efficiently, and the lightweight nature of UDP also gives great flexibility to deployment.

However, in order for UDP-based transport to provide a reliable, orderly application interface like TCP, it is still necessary to add congestion control and reliability mechanisms. However, current QUIC designs still employ mainly existing congestion control and retransmission mechanisms of TCP. In a fat wireless link, the original problems of TCP still exist.

Disclosure of Invention

In view of the above, the present invention aims to provide a packet transmission method based on reinforcement learning and stream coding driving, so as to solve the problem of low bandwidth utilization of packet transmission links in long and fat wireless links in the existing TCP and QUIC technologies.

The invention provides a packet transmission method based on reinforcement learning and stream coding driving, which comprises the following steps:

s1, setting stream coding parameters;

s2, a sending end sends a packet, wherein the packet is an uncoded source packet or a coded repair packet;

s3, the receiving end decodes and recovers the received packets and orderly transmits the packets to an upper layer application, and simultaneously sends feedback information to the sending end, wherein the feedback information comprises decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets;

s4, the sending end processes the feedback information, determines system state information, calculates reward and punishment values according to a reward function, estimates available bandwidth of a link, determines interval time of sending actions of the sending end according to the available bandwidth of the link, and then conducts reinforcement learning;

the reinforcement learning is executed based on a reinforcement learning model, and the reinforcement learning method comprises the following steps:

s41, outputting a value function after weight updating and the value of each sending action according to the system state information and the reward and punishment values;

s42, selecting an optimal sending action according to the value of each sending action, wherein the optimal sending action is used as the sending action with the maximum value in the current state;

the system state information comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number; the sending action is one of sending source packet, sending repair packet and abandoning sending; the reward function is determined according to an optimization objective of packet transmission that maximizes its throughput for each user stream while minimizing latency;

s43, the sending end realizes sending action according to the optimal sending action selected in the step S42;

and S5, repeating the steps S3 and S4 to realize congestion control and stream coding rate control.

Further, the repair packet is a linear combination of source packets that have been previously transmitted, as shown in the following equation:

wherein, c_kDenotes a repair packet numbered k, k being 0,1,2,3, …; g_k，iIs from a finite field

The selected stream coding coefficients; w is a_sFor the number of the oldest source packet in the current transmit queue, w_sIs 0, w_sThe value of (c) is continuously updated according to the feedback information; i.e. i_seqIndicating the number of the last transmitted source packet.

Further, the reward function is expressed as follows:

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the elapsed time; inp is the number of all packets sent by the sending end divided by the time used; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter.

Further, the cost function is obtained by the following steps:

and mapping the system state information into a characteristic vector only containingdiscrete values 0 and 1 by adopting a tile coding mode, and fitting the characteristic vector in a linear function form by combining the reward and punishment values to obtain a cost function.

Further, the selecting an optimal sending action according to the value of each sending action specifically includes: and selecting the optimal sending action by using an e-greedy strategy.

The invention also provides a packet transmission system based on reinforcement learning and stream coding driving, which comprises:

a transmitting end, configured to transmit a packet, where the packet is an uncoded source packet or a coded repair packet;

the receiving end is used for decoding and recovering the received packets and orderly transmitting the packets to an upper layer application, and simultaneously sending feedback information to the sending end, wherein the feedback information comprises decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets;

the state space unit is arranged at the sending end and used for processing the feedback information and determining system state information; the system state information comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number;

a reward function unit for calculating an output reward penalty value according to a reward function as shown below;

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the elapsed time; inp is the number of all packets sent by the sending end divided by the time used; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter;

the system state information is mapped into a characteristic vector only containingdiscrete values 0 and 1 by adopting a tile coding mode, then a cost function is obtained by combining the reward and punishment values and fitting in a linear function form of the characteristic vector, and the value of each sending action is output;

and the action selection unit is used for selecting the sending action with the maximum value by adopting an e-greedy strategy according to the value of each sending action output by the value fitting unit and sending the sending action by the sending end.

Compared with the prior art, the invention has the following beneficial effects:

1. on the one hand, the technical scheme of the invention adopts stream coding to realize packet loss recovery and provides a reliability mechanism for UDP. The method has higher throughput than a retransmission scheme and has smaller decoding delay than a block code (block code); on the other hand, the invention is based on a reinforcement learning model, on-line learning is carried out according to the current network condition and the packet loss rate, the packet sending interval is dynamically adjusted, the sending packet type is intelligently selected, the joint optimization of the stream coding code rate (the proportion of two actions of sending source packets and repairing packets) control and congestion control is realized, the throughput of the network is improved, the data transmission delay is reduced, and the invention can adapt to changeable link conditions.

2. A large amount of sample data is not needed, only the information of the external environment (the congestion condition of the network and the ordered packet receiving progress of the receiving end at the moment) is needed to carry out self-learning model on-line training, and the artificial experience and the external data information are rarely relied on.

3. The sending end can learn and make a decision on line according to the network condition, so that the packet sending is more intelligent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be noted that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a block diagram of a packet transmission system according to the present invention.

Fig. 2 is a block diagram illustrating a structure of a reinforcement learning model in the packet transmission system according to the present invention.

Fig. 3 is a graph comparing throughput of the transmission method of the present invention with other methods.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a packet transmission method based on reinforcement learning and stream coding driving, which specifically comprises the following steps:

s1, setting stream coding parameters;

the stream coding parameters are seeds of a pseudo-random number generator used to obtain the stream coding coefficients.

the transmitting end may send two packets, one being an uncoded source packet and the other being a coded repair packet. Let i_seqNumber indicating the most recently transmitted uncoded source packet, initialize i_seq-1, i after each transmission of a source packet_seqAnd adding 1. The repair packet is represented as

Which is a linear combination of source packets that have been previously transmitted. In the formula (1), c_kDenotes a repair packet numbered k, g_k，iIs from a finite field

Where k is 0, and 1,2 … is the number of the repair packet. w is a_sCorresponding to the number of the oldest (old) source packet in the current transmit queue. Initialization w_sAt 0, the original packet acknowledged as received will be removed from the queue according to the feedback from the receiving end, at which time w_sAn update will be made. Let w_e＝i_seq，[w_s，w_e]Referred to as the coding window of the current repair packet.

The receiving end decodes and recovers the received packet and transmits the packet to the upper layer application in order. Let i_ordIndicating the latest in-order transport packet number, initializing i_ordThe decoder initial state is an ordered state-1. If the decoder next receives a packet that is neither true

Nor do they have w_e＝i_ordA repair packet of a nature means that the in-order transmission is interrupted. The decoder enters an out-of-order state where it will buffer the received packet and attempt decoding. The buffered packets are out-of-order source packets (numbered greater than i)_ord+1) or repair packets (where w_e＞i_ord+1). Order to

Make it

The maximum number of the upper bound of the coding window in the buffered repair packets.

Referred to as the decoder current decoding window. As more packets are buffered, the window may expand (i.e., the window may expand)

Growth). The decoder decodes using gaussian elimination, i.e. dynamically constructs a linear system of equations AS ═ B and performs forward elimination on-line, where the rows of a and B are the coding coefficients of the buffered packets (unordered source packets are treated AS special repair packets with coding coefficients having only one non-zero element 1) and the coded information symbols, respectively. When decoding is successful, the decoded source packets in the decoding window are all transmitted to the upper layer application, the decoder is restored to an ordered state, and the source packets are transmitted in order

And the process is restarted.

And S4, the sending end processes the feedback information, determines system state information, calculates reward and punishment values according to a reward function, estimates the available bandwidth of the link, determines the interval time of sending actions of the sending end according to the available bandwidth of the link, and then executes a learning process based on a reinforcement learning model.

In the invention, the system state information is used for representing the network condition, and specifically comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number. The transmission action is one of transmission source packet, transmission of repair packet, and abandonment of transmission (backoff). The interval between the sending actions is set to 2/3 packet size divided by the link available bandwidth.

In the invention, the reward function is determined according to the optimization target of the packet transmission, and the optimization target of the packet transmission is set to maximize the throughput of each user flow while reducing the time delay to the maximum extent. Specifically, the embodiments of the present invention design the reward function as follows:

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the time taken for transmission to date; inp is the number of all packets sent by the sending end divided by the time taken for the current transmission; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter. In the embodiment of the present invention, τ is set to 1.2, and it can be seen that the function emphasizes that each user stream should try to maximize its throughput while minimizing the delay. The log function ensures that the network can fairly allocate bandwidth resources when multiple users compete for the same bottleneck link.

Through one action, if the utility function value is increased, a positive prize can be obtainedA penalty value. If the utility function value decreases and RTT_ratio≥τ，RTT_ratioThe ratio of the currently smoothed estimated RTT to the minimum value of RTT, so that RTT_ratioMore than or equal to tau represents congestion, the reward and penalty value is a negative value, and the closer gp/inp is to 1, the smaller the reward and penalty value is. In other cases, the reward penalty value is zero.

The system state with continuous values is mapped into a feature vector only containingdiscrete values 0 and 1 by adopting a tile coding (telecom) mode. A cost function reflecting the value of each transmission action is then fitted in the form of a linear function of this feature vector. The learning process of reinforcement learning is to obtain a weight of a cost function for each transmission action.

Specifically, the reinforcement learning process of the invention comprises the following steps:

s42, selecting the sending action with the maximum value (namely the optimal sending action) in the current state according to the value of each sending action;

and S43, the sending end realizes the sending action according to the optimal sending action selected in the step S42.

Specifically, the sending end selects whether to send the packet at present when the sending action moment comes; if the packet is determined to be transmitted, it is further determined whether to transmit a new source packet or to generate a repair packet based on a source packet that has been transmitted previously.

In the embodiment of the invention, the optimal sending action is selected by using an e-greedy strategy. The e-greedy strategy is specifically as follows: if the current random probability is lower than oa, an action is selected at random, otherwise the action with the highest value is selected in the current state. Selecting actions according to an e-greedy strategy, and realizing the combined optimization of stream coding rate control and congestion control; and the new packets or the repair packets to be sent are sequentially stored in the UDP transmission buffer to be sent.

And S5, continuously repeating the steps S3 and S4, dynamically adjusting the packet sending interval and intelligently selecting the sent packet type according to the current network condition and the packet loss rate, and realizing the joint optimization of stream coding rate control and congestion control so as to realize congestion control and stream coding rate control.

The code rate of the stream coding is the ratio of two actions of sending source grouping and repairing grouping. R ═ a/(a + b), where a is the number of transmitted source packets and b is the number of transmitted repair packets. Therefore, the technical scheme provided by the invention controls the code rate by controlling the action proportion of the sending source grouping and the repairing grouping.

As shown in fig. 1, the present invention further provides a packet transmission system based on reinforcement learning and stream coding driving, where the packet transmission system includes a transmitting end, a receiving end, a state space unit, a reward function unit, a value fitting unit, and an action selection unit. Wherein the state space unit, the reward function unit, the value fitting unit and the action selection unit constitute a reinforcement learning model as shown in fig. 2.

The system comprises a sending end and a receiving end, wherein the sending end is provided with an encoder, and the encoder sends an uncoded source packet or a coded repair packet;

and the receiving end is used for decoding and recovering the received packets and orderly transmitting the packets to the upper layer application, and simultaneously sending feedback information to the sending end, wherein the feedback information comprises the decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets.

The state space unit is arranged at the sending end and used for processing the feedback information sent by the receiving end and determining the system state information; the system state information includes the ratio of the current packet round trip delay to the minimum packet round trip delay, the ratio of the number of currently transmitted packet actions to the total number of actions, and the ratio of the number of currently transmitted source packets to the total number of packets.

And the reward function unit is used for calculating an output reward punishment value according to the reward function.

And the value fitting unit is used for mapping the system state information into a feature vector only containingdiscrete values 0 and 1 in a tile coding mode, then fitting the feature vector in a linear function form by combining the reward and punishment values to obtain a value function, and outputting the value of each sending action.

And the action selection unit is used for selecting the sending action with the maximum value by adopting an e-greedy strategy according to the value of each sending action output by the value fitting unit and sending the sending action by the sending end. And the new packets or the repair packets to be sent are sequentially stored in the UDP transmission buffer to be sent.

In specific application, a receiving end sends a stream coding packet, a receiving end decoder decodes the stream coding packet, and decoding and receiving progress information and congestion indexes are continuously fed back to a sending end. The sending end abstracts the state information according to the feedback information, calculates a reward value according to a reward function, inputs the reward value and the state information into a value fitting unit to obtain corresponding values of all actions, updates related fitting parameters, and finally selects the optimal action according to an action selection unit. The reinforcement learning process is a process of continuously iterating and updating the intelligent agent value fitting function driven by the feedback of the receiving end. The model can be continuously learned along with the progress of the packet transmission process, and the joint optimization of congestion control and stream coding rate control is realized. The network simulation result shows that the method is used under the condition of wireless long and fat links. The throughput obtained by the method of the invention is far better than that obtained by other methods. As shown in fig. 3, the effective throughput (gp) obtained by the scheme of the present invention on a fat wireless link with 1% packet loss rate, 100 ms delay, and 20Mbps bandwidth is much higher than that of existing schemes such as QUIC, TCPBBR, and TCPCUBIC.

Although the present invention has been described in terms of the preferred embodiment, it is not intended that the invention be limited to the embodiment. Any equivalent changes or modifications made without departing from the spirit and scope of the present invention also belong to the protection scope of the present invention. The scope of the invention should therefore be determined with reference to the appended claims.

Claims

Translated fromChinese

1.一种基于强化学习和流编码驱动的分组传输方法，其特征在于，包括以下步骤：1. a packet transmission method driven by reinforcement learning and stream coding, is characterized in that, comprises the following steps:

S1.设定流编码参数；S1. Set stream encoding parameters;

S2.发送端发送分组，所述分组为未编码的源分组或经过编码的修复分组；S2. The sending end sends a packet, and the packet is an uncoded source packet or an encoded repair packet;

S3.接收端对收到的分组进行解码恢复并有序地传输到上层应用，同时向所述发送端发送反馈信息，所述反馈信息包括解码进度、最新收到分组的编号与类型、收到的源分组数量和收到的修复分组的数量；S3. The receiving end decodes and restores the received packet and transmits it to the upper-layer application in an orderly manner, and sends feedback information to the transmitting end at the same time, the feedback information includes the decoding progress, the number and type of the latest received packet, the received The number of source packets and the number of repair packets received;

S4.所述发送端对所述反馈信息进行处理，确定系统状态信息，根据奖赏函数计算奖惩值，估算链路可用带宽，并根据所述链路可用带宽确定所述发送端发送动作的间隔时间，然后进行强化学习；S4. the sending end processes the feedback information, determines the system state information, calculates the reward and punishment value according to the reward function, estimates the available bandwidth of the link, and determines the interval time of the sending action of the sending end according to the available bandwidth of the link , and then perform reinforcement learning;

所述强化学习基于强化学习模型执行，包括如下步骤：The reinforcement learning is performed based on the reinforcement learning model, and includes the following steps:

S41.根据所述系统状态信息和所述奖惩值，输出更新权重后的价值函数和各个所述发送动作的价值；S41. According to the system state information and the reward and punishment value, output the value function after updating the weight and the value of each described sending action;

S42.根据各个所述发送动作的价值选择最优发送动作，所述最优发送动作为当前状态下价值最大的发送动作；S42. Select the optimal sending action according to the value of each described sending action, and the optimal sending action is the sending action with the greatest value in the current state;

其中，所述系统状态信息包括当前分组往返时延与最小分组往返时延之比、当前发送分组动作数与总动作数之比、当前发送源分组数与总分组数之比；所述发送动作为发送源分组、发送修复分组和放弃发送中的一种；所述奖赏函数根据分组传输的优化目标来确定，所述分组传输的优化目标为每个用户流都在最大程度地减少时延的同时最大化其吞吐量；Wherein, the system status information includes the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current number of sending packet actions to the total number of actions, and the ratio of the current number of sending source packets to the total number of packets; the sending action It is one of sending source packets, sending repair packets and giving up sending; the reward function is determined according to the optimization goal of packet transmission, and the optimization goal of packet transmission is that each user flow minimizes the delay. while maximizing its throughput;

S43.所述发送端根据步骤S42选择的最优发送动作实现发送动作；S43. The sending end implements the sending action according to the optimal sending action selected in step S42;

S5.重复步骤S3和S4，以实现拥塞控制和流编码码率控制。S5. Repeat steps S3 and S4 to implement congestion control and stream coding rate control.

2.根据权利要求1所述的分组传输方法，其特征在于，所述修复分组为先前已发送过的源分组s_i的线性组合，具体如下式所示：2. The packet transmission method according to claim 1, wherein the repair packet is a linear combination of the previously sent source packet_si , as shown in the following formula:

其中，c_k表示编号为k的修复分组，k＝0,1,2,3，…；g_k,i是从有限域

中选择的流编码系数；w_s为目前发送队列中最早的源分组的编号，w_s的初始值为0，w_s的值会根据所述反馈信息不断更新；i_seq表示最近一次发送的源分组的编号。Among them,_ck represents the repair group numbered k, k=0, 1, 2, 3, ...; g_k,i is the data from the finite field

The stream coding coefficient selected in ;_ws is the number of the earliest source packet in the current sending queue, the initial value of ws is 0, and the value of_ws will be continuously updated according to the feedback information;_i_seq represents the source of the latest transmission The number of the group.

3.根据权利要求1所述的分组传输方法，其特征在于，所述奖赏函数如下式所示：3. The packet transmission method according to claim 1, wherein the reward function is shown in the following formula:

其中，R(s,a)表示系统状态信息为s，发送动作为a时的奖惩值；gp为有效吞吐量，即所述接收端收到的有序源分组数量除以所用时间；inp为所述发送端发送的所有分组数量除以所用时间；U_n为效用函数，U_n＝log(gp)-δlog(RTT)，RTT为最小往返时延的平滑估计值；RTT_ratio为当前平滑估计出的RTT和RTT最小值的比值；τ为预设的超参数。Among them, R(s, a) represents the reward and punishment value when the system state information is s and the sending action is a; gp is the effective throughput, that is, the number of ordered source packets received by the receiving end divided by the time taken; inp is The number of all packets sent by the transmitting end is divided by the time taken; U_n is a utility function, U_n =log(gp)-δlog(RTT), RTT is the smooth estimated value of the minimum round-trip delay; RTT_ratio is the current smooth estimated value The ratio of the RTT and the minimum value of the RTT; τ is the preset hyperparameter.

4.根据权利要求1所述的分组传输方法，其特征在于，所述价值函数具体通过以下步骤获得：4. packet transmission method according to claim 1, is characterized in that, described cost function is specifically obtained through the following steps:

采用瓦片编码的方式将所述系统状态信息映射为只含有离散值0和1的特征向量，然后结合所述奖惩值以所述特征向量的线性函数形式拟合得到价值函数。The system state information is mapped into a feature vector containing only discrete values 0 and 1 by means of tile coding, and then a value function is obtained by fitting the linear function of the feature vector in combination with the reward and punishment values.

5.根据权利要求1所述的分组传输方法，其特征在于，所述根据各个所述发送动作的价值选择最优发送动作，具体为：运用e-greedy策略选出最优发送动作。5 . The packet transmission method according to claim 1 , wherein the selecting an optimal sending action according to the value of each of the sending actions is specifically: selecting an optimal sending action by using an e-greedy strategy. 6 .