Systems and Methods for a Transport Coordination ServiceTECHNICAL FIELDThe present disclosure pertains to the field of network management and telecommunication technology, and in particular to systems and methods for a transport coordination service (TCS) .
BACKGROUNDAs the data center networks’ bandwidth-delay product is increasing and the applications are moving to nano services (with many small flows) , managing flows in the network is becoming more challenging. Current transport control protocol (TCP) /internet protocol (IP) stack faces fundamental limitations to keep up with the humongous growth in the modern network and applications. The current TCP/IP stack lacks the ability to accurately estimate the network state under dynamic network settings. Second, the current stack is not flexible enough to be extended easily. Any change to the TCP/IP layer may add significant complexity to the kernel stack and makes packet processing slow.
Therefore, there is a need for systems and methods for a TCS that obviates or mitigates one or more limitations of the prior art.
This background information is provided to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed that any of the preceding information constitutes prior art against the present invention.
SUMMARYApparatus, methods and systems for a TCS may be provided according to one or more aspects. According to an aspect, a method for managing flows may be provided. The method may include detecting, by a node of a plurality of nodes of a network, a lifecycle event associated with a first flow. The method may further include managing, by the node, a local cache based on the lifecycle event. The local cache may be configured to store state information of one or more groups of flows. Each group of the one or more groups of flows may include one or more flows associated based on at least one of: sharing a common path in the network and being directed toward a same destination address. The state information for each said group may include one or more metrics based on transport layer information of said group of flows.
The lifecycle event may indicate an arrival of the first flow at the node. The first flow may be destined to a first destination address. Managing, by the node, the local cache based on the lifecycle event may include obtaining, by the node, state information of a first group of flows of the one or more groups of flows. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward the first destination address. The method may further include sending, by the node, data of the first flow at a rate determined based on the obtained state information.
Obtaining, by the node, state information of the first group of flows may include retrieving, by the node from the local cache, the state information of the first group of flows.
Obtaining, by the node, state information of the first group of flows may include sending, by the node to a second node of a second network, a request message requesting the state information of the first group of flows. The second network may be the same as or different from the network. Obtaining, by the node, state information of the first group of flows may further include receiving, by the node from the second node, a response message including the state information of the first group of flows.
The method may further include sending, by the node to a second node of a second network, a request message requesting the state information of the first group of flows. The second network may be the same as or different from the network. The method may further include receiving, by the node from the second node, a response message including the state information of the first group of flows. The method may further include updating, by the node, the local cache based on the received state information of the first group of flows.
The method may further include determining, by the node, that the state information of the first group of flows in the local cache is unreliable based on one or more of: local cache being empty of the state information of the first group of flows; the state information of the first group of flows at the local cache being stale; and the state information of the first group of flows indicating a value that is substantially different from an expected value.
The lifecycle event may indicate a departure of the first flow from the node. Managing, by the node, the local cache based on the lifecycle event may include updating, by the node, the local cache based on the lifecycle event. Managing, by the node, the local cache based on the lifecycle event may further include sending, by the node to a second node of a second network, a cache update message indicating that the first flow has departed from the node. The second network may be the same as or different from the network.
The lifecycle event may indicate a transmission impairment event of the first flow. Managing, by the node, the local cache based on the lifecycle event may include sending, by the node to a second node of a second network, a cache update message indicating the transmission impairment event, wherein the second network is same as or different from the network. The transmission impairment may be one or more of: a congestion notification signal, delay, and a packet drop.
The lifecycle event may be a passage of time since the state information of a first group of flows of the one or more group of flows last updated. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. Managing, by the node, a local cache based on the lifecycle event may include sending, by the node to a second node of a second network, a request message for updated state information of the first group of flows. The second network may be the same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include receiving, by the node from the second node, a response message including the updated state information of the first group of flows.
The lifecycle event may be a passage of time since a last cache update message was sent to a second node of a second network. The last cache update message may have been sent at a last update time and include the state information of a first group of flows of the one or more group of flows in the local cache at the last update time. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address, wherein the second network is same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include sending, by the node to the second node at a second time after the last update time, an updated cache update message including state information of the first group of flows in the local cache at the second time.
Detecting, by the node of the network, the lifecycle event associated with the first flow may include receiving, by the node from a second node of a second network, a request message for state information of a first group of flows of the one or more groups. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. The second network may be the same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include retrieving, by the node from the local cache, the state information of the first group of flows. The method may further include sending, by the node to the second node, a response message including the retrieved state information of the first group of flows.
Detecting, by the node of the network, the lifecycle event associated with the first flow may include receiving, by the node from a second node of a second network, a cache update message. The second network may be the same as or different from the network. The cache update message may indicate one of: an arrival of the first flow at the second node; a departure of the first flow from the second node; a transmission impairment event of the first flow; and an updated state information of a first group of flows of the one or more groups of flow. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. Managing, by the node, a local cache based on the lifecycle event may include updating, by the node, the local cache based on the cache update message.
Each of the request message and the response message may include a packet indicating one or more of: a source address, a destination address, a source port, a destination port, a length of the packet, a type of message, the one or more metrics of the first group of flows associated with the destination address, the destination address. Each of the request message and the response message may be communicated via a protocol for sharing the state information of the one or more groups of flows among the plurality of nodes of the network.
The cache update message may include a packet indicating one or more of: a source address, a destination address, a source port, a destination port, a length of the packet, a type of message, one or more metrics of the first group of flows associated with the destination address, the destination address. The cache update message may be communicated via a protocol for sharing the state information of the one or more groups of flows among the plurality of nodes of the network.
The state information for each group of flows of the one or more groups of flows may further include a last update time indicating a last time said the state information was updated. The one or more metrics may indicate one or more of: a congestion window (CWND) , a round-trip time (RTT) , a flow completion time (FCT) , a number of active flows, a number of early congestion notifications (ECN) , a number of packet drops. At least one of the one or more metrics may be an aggregated value of the one or more flows of said each group of flows.
According to another aspect, an apparatus is provided. The apparatus includes modules configured to perform one or more of the methods and systems described herein.
According to one aspect, an apparatus is provided, where the apparatus includes: a memory, configured to store a program; a processor, configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform one or more of the methods and systems described herein.
According to another aspect, a computer readable medium is provided, where the computer readable medium stores program code executed by a device and the program code is used to perform one or more of the methods and systems described herein.
According to one aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads, by using the data interface, an instruction stored in a memory, to perform one or more of the methods and systems described herein.
Other aspects of the disclosure provide for apparatus, and systems configured to implement the methods according to the first aspect disclosed herein. For example, wireless stations and access points can be configured with machine readable memory containing instructions, which when executed by the processors of these devices, configures the device to perform one or more of the methods and systems described herein.
Embodiments have been described above in conjunction with aspects of the present invention upon which they can be implemented. Those skilled in the art will appreciate that embodiments may be implemented in conjunction with the aspect with which they are described but may also be implemented with other embodiments of that aspect. When embodiments are mutually exclusive, or are incompatible with each other, it will be apparent to those skilled in the art. Some embodiments may be described in relation to one aspect, but may also be applicable to other aspects, as will be apparent to those of skill in the art.
BRIEF DESCRIPTION OF THE DRAWINGSFurther features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
FIG. 1 illustrates an implementation of TCS, according to an embodiment.
FIG. 2 illustrates the use of TCS to manage data center traffic, according to an embodiment.
FIG. 3 illustrates TRP agents at different nodes of a network, according to an embodiment.
FIG. 4 illustrates a TRP agent, according to an embodiment.
FIG. 5 illustrates a TRP packet, according to an embodiment.
FIG. 6 illustrates a method of managing a flow, according to an embodiment.
FIG. 7 illustrates another method of managing a flow, according to an embodiment.
FIG. 8 illustrates another method of managing a flow, according to an embodiment.
FIG. 9 illustrates an apparatus that may perform any or all of operations of the above methods and features explicitly or implicitly described herein, according to different aspects of the present disclosure.
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
DETAILED DESCRIPTIONApparatus, methods and systems for a TCS may be provided according to one or more aspects. According to an aspect, a method 800 for managing flows may be provided. The method 800 may include detecting 801, by a node of a plurality of nodes of a network, a lifecycle event associated with a first flow. The lifecycle event may indicate one of an arrival of the first flow; a departure of the first flow; a transmission impairment event of the first flow; and an updated state information of a first group of flows related to the first flow. The method 800 may further include managing 802, by the node, a local cache 402 based on the lifecycle event. The local cache may be configured to store state information of one or more groups of flows including the first group of flows. Each group of the one or more groups of flows may include one or more flows associated based on at least one of: sharing a common path in the network and being directed toward a same destination address. The state information for each said group may include one or more metrics based on transport layer information of said group of flows.
According to an aspect, a service which may be referred to as a Transport Coordination Service (TCS) may be provided to address one or more limitations described herein. An objective of TCS is to aggregate the transport layer information (TI) and bring it close to where the decisions are made in a proactive manner. The required pieces of information to make accurate and timely transport decisions may be spread all over the network (not only at the end-host) . However, currently there is no explicit mechanism to share this information among different flows and devices. TCS aims to extend this information sharing not only among end-hosts but also to the switches.
For the congestion control problem, as an example, congestion may happen at the source, in the network, or at the destination. Congestion signals that individual flows observe may include partial and delayed information. TCS may provide a mechanism for flows to have a collective view of the congestion close to flow sources where the flows react to congestion. Sharing this collective view may enable proactive in-network congestion management or adaptive load balancing mechanisms among many other possibilities.
Some works in literature have proposed information aggregation at the transport layer. However, TCS has fundamental differences. First, traditional mechanisms aggregate connection states of multiple flows locally. However, TCS extends this to share information not only locally at the end-hosts, but also globally among end-hosts and switches. Second, many of these works aggregate information and enforce different decisions (such as congestion window adjustment) on the aggregate flow groups. However, TCS may only expose this information to the underlying transport and leaves the control to the transport algorithms. Third, unlike existing mechanisms, TCS aims to build a service that is available to any type of transport (TCP, remote direct memory access (RDMA) , etc. ) .
According to an aspect, TCS may be implemented as a distributed service with the presence in one or more of: end-host stack, network interface card (NIC) , and switches as shown in FIG. 1 and FIG. 3. FIG. 1 illustrates an implementation of TCS, according to an embodiment. As described herein, TCS may be implemented, in part, via a custom protocol, which may be referred to as transport or transmission cooRdination protocol (TRP) . TRP may be used, by TRP agent (s) 100, to gather and share information among flows locally and globally. At each node (e.g., end-host or switch) of the network, a TRP agent 100 may communicate with one or more of: the application 102, TCP 104, IP 104 and NIC 106 as illustrated.
FIG. 2 illustrates the use of TCS to manage data center traffic, according to an embodiment. TCS may be used to manage traffic within a data center (DC) network (DCN) . For example, TCS may be used to manage a flow between S1 of DC-Aand R1 of DC-A. TCS may further be used to manage traffic between data centers. For example, TCS may be used to manage a flow between S2 of DC-Aand R2 of DC-B.
According to an aspect, TCS may allow for managing flows (or packets) in the network as described herein. TCS can also be used to change congestion control behavior or load balancing on the fly. For example, TCS can be used to change congestion window of an ongoing flow or change packet priorities on the fly, for load balancing or flow rerouting.
According to an aspect, TCS may run as a coordination service to guide the end-host transport. At a high level, TCS may include a (custom) communication protocol, which may be called (TRP) , to gather and share information among flows locally and globally. TCS may further include an aggregator to aggregate and analyze the collected information. TCS may further include an interface (e.g., a TCS interface) to maintain, expose, and share network state information and suggest actions (control laws) locally to the underlying transport technology.
As described herein, a component of TCS is the TRP. TRP may be preset at end-hosts and network switches through TRP agents. FIG. 3 illustrates TRP agents at different nodes of a network, according to an embodiment. At each node of the network, a TRP agent may be present to provide TCS. In FIG. 3 TRP agents 311 and 312 are present at hosts 301 and 302 respectively. And TRP agents 313 and 314 are present at top-of-rack (ToR) switches 303 and 304 respectively.
TRP agents may collect information from end-hosts and other neighboring agents using a custom protocol. Each TRP agent 311, 312, 313, 314 may send queries and updates to and receive responses from the other TRP agent. For example, TRP agent 311, at host 301, may send queries and updates to and receive responses from TRP agent 313, at ToR switch 303, TRP agent 312 at host 302 (and TRP agent 314 at ToR switch 304 –not shown in FIG. 3) . Similarly, TRP agent 313 may send queries and updates to and receive response from TRP agents 311, 312, and 314.
Accordingly, TRP agents may communicate with other TRP agents at one or more different levels of the hierarchical network architecture. According to an aspect, a TRP agent may communicate with another TRP agent in a same or different data center. A use case of TRP may be in the cloud network scenario. For example, a TRP agent on a home Wi-Fi router may communicate with a TRP agent at an internet service provider gateway to share congestion information. Another example may be a TRP agent at a user equipment (e.g., a cellphone) interacting with a base station (such as the base station of the internet service provider) to obtain more accurate congestion estimate.
According to an aspect, a TRP agent may maintain a local cache for storing hierarchically aggregated information, i.e., TRP state. This aggregated information may be readily available for any transport protocol.
According to an aspect, a TRP agent, implemented on a host (ahost TRP agent) , may interact with the TCP/IP, smart NIC/RDMA, and the application layer at the end-host. A TRP agent on a host may be connected to TRP agents located on the other hosts as well as at the switches.
According to an aspect, a TRP agent at a switch (aswitch TRP agent) may be implemented in data-plane and communicate with host TRP agents and other neighboring switch TRP agents.
According to an aspect, a TRP state for a group of flows may be defined or described as a tuple: (S1; S2; : : : ; SM; T) where S1; S2; : : : ; SM may represent different metrics of interest based on transport information, and T may be the last time the state has been updated. In some embodiments, each metric may be an output of an aggregation function (e.g., average, summation, minimum, variance) over the individual flows (e.g., based on data from individual flows) . Examples of metrics may include one or more of: average congestion window or average congestion control window (CWND) , average Round-Trip Time (RTT) , maximum Flow Completion Time (FCT) , summation of the number of active flows, and summation of the number of Early Congestion Notifications (ECN) or packet drops. The update time, T, may be used to evaluate the information freshness. The TRP state can be the aggregate of one or both of: the local flow information and global information.
FIG. 4 illustrates a TRP agent, according to an embodiment. TRP agent 400 may be similar to any of the TRP agents 100, 311, 312, 313 and 314. TRP agent 400 may comprise a cache 402 for storing a key-value store table as illustrated. The key in this table can be a group id defined based on the transport layer tuple or some application layer level metric, and the value may be the TRP state of the flows belonging to this group. For example, the key can be a destination IP of the associated flows and the value may be the TRP state of the flows going to the destination IP.
TRP agent 400 may comprise one or more components as illustrated. Different components of the TRP agent 400 and their interactions are shown in FIG. 4. One responsibility of the TRP agent is to respond to queries about the transport state. These queries 404 may be received from a host or a switch at which the TRP agent is implemented. Queries 404 may also be received from the other TRP agents at one or more other nodes of the network. TRP agent comprise a Responder 406, which receives the queries 404 and responds to them by sending response message 408.
When a query 404 is received, the TRP agent may look up or read 410 the requested key in its local TRP cache 402. If the key exists in the cache (e.g., cache hit 412) , the value will be forwarded to an assessor component 414. The assessor 414 may determine, by calculating, a score for the retrieved value based on one or two properties: freshness and performance of the entry. Freshness may be estimated based on the stored value of T, which indicates the last update time. Performance may be determined or calculated based on the transport metrics of the entry.
For a congestion control example, assessor 414 may calculate the ratio of the base RTT to the observed average RTT, which gives a value between 0 and 1, and higher values mean a lower congested state. Based on the generated score by the assessor and a randomly generated probability, the TRP agent may decide whether to respond with the retrieved state from its local cache or to query a neighbor TRP agent. For higher scores 416, i.e., the entry is fresh, and the metric performance is good or adequate, the probability of responding with the local state is higher. If the local cache does not contain the key (cache miss 418) or the calculated score is low 420, an enquirer component 422 may generate a query message 424 and waits for its response 426. The query message may request from one or more nodes of the network (e.g., a neighbor host node or a switch) for state value (e.g., transport layer information) related to the key. A node receiving query 424 may reply with a response message 426 including the requested state value. When the response message is received, enquirer 422 may update the cache 402 by writing 428 the received state value therein. Enquirer 422 may further send the received state value in a response 430 to the responder 408, which then responds 406 to the query with the state value.
According to an aspect, when a major change happens in the TRP state, the cache 402 may be updated with a new TRP state based on the change. According to an aspect, an advertiser component 432 may read 434 the new TRP state and update 436 the neighboring TRP agents about the new TRP state. Some examples of events that may notably or substantially change the TRP state may include: flow departures, flow arrivals and packet drops. Whenever the TRP agent receives a state update 438 from a node (e.g., a neighbor node) , a state aggregator 440 may be responsible for aggregating the local state with the received update (e.g., using a weighted moving average) and writing 442 it to the local cache.
According to an aspect, the state aggregator may aggregate and may employ different learning agents to generate insights for the transport. As may be appreciated, the exact analysis method can vary depending on the use case. Implementing such a mechanism in user space may provide the benefits of doing calculations (floating point) or running programs (ML analysis) that may not be run in kernel space trivially. For example, by doing these calculations at the user-level, a more involved control loop may be used (or the freedom to use a more involved control loop may be provided) that could possibly take more time than the allowed lifetime of a kernel-level application. This may allow for creating an opportunity for placing learning agents as decision makers.
FIG. 5 illustrates a TRP packet, according to an embodiment. The TRP packet 500 may be used for communicating TRP messages (e.g., a query message, an update message and a response message) . The TRP packet 500 may include a first part comprising one or more fields indicating one or more of: a source address 502 of the TRP agent who is sending the message (e.g., source SRC IP) , a destination address 504 of the TRP agent who is the recipient of the message (destination IP) , a source port (SRC Port) 506, a destination port (DST Port) 508, a length or a total length of the packet (Len) 510. The TRP packet 500 may include a second part comprising one or more fields indicating one or more of: a type of the TRP packet 512 (OP Code: 00 for Query, 01 for Update, and 10 for Response) , a TRP cache key or a destination address whose information is asked or shared (Target IP 514) , and a TRP state mapped to Target IP (Metrics 516) .
As mentioned, TRP packet 500 may indicate the type of TRP message. The TRP may support a TRP query packet type (corresponding to a TRP query message) , a TRP update packet type (corresponding to a TRP update message) , and a TRP response packet type (corresponding to a TRP response message) .
A TRP query message 424 may be used when a key (destination IP) is not available in the local TRP cache 402 of a TRP agent 400, or the key exists, but the state is stale (i.e., T is larger than a threshold) , or the state is scored as a “bad” performance state. The TRP agent may need to query the information. In this case, a TRP Query message 424 may be sent to other TRP agents asking about the state of a particular key. One of the use cases of TRP query message may be when a new flow joins, or activates after a recovery or inactive period, and wants to promptly learn about the state of the network.
A TRP update message 436 may be sent when a TRP agent 400 wants to inform another agent (e.g., a neighboring agent) about an update in the state of a key. An update happens when a notable or reasonably substantial change in the state happens, for example, a new flow joins, a flow departs, or a notable or reasonably substantial congestion (or availability) is perceived. In some embodiments, TRP agents may send update messages 436 to each other in a periodic basis such that the communication overhead is negligible (e.g., 1s interval) .
A TRP Response message 406 may be a response to TRP query messages 404. A TRP agent 400 that receives a TRP query message 404 from another node (e.g., a neighbor node) for a specific key, the TRP agent, via the responder 408, may respond with the state mapped to that key in its local cache.
FIG. 6 illustrates a method of managing a flow, according to an embodiment. The method 600 may include receiving, at a source node 630, a flow 602 (flow arrives at the source node) . The flow may be directed towards or have its destination as destination node 644. The source node 630 may include or have implemented on it a TRP agent 640, which may be similar to the TRP agent 400. The source node 630 may notify its TRP agent 640 of the arrival of the flow through some procedure call. To manage the flow, the source node 630 may, via its TRP agent 640, collect information about the network state associated with the flow to decide on how manage the flow (e.g., determine at what time and what rate to transmit the flow) . Information about the network state associated with the flow may refer to state information of a group of flows related to or associated with the flow. The group of flows may include one or more flows that are related to the flow based on one or more of: sharing a common path and being directed to or toward a same destination (e.g., destination node 644) .
TRP agent 640 may obtain the state information of the group of related flows and include the obtained state information in a TRP response message 612 and send it to the source node 630. In an embodiment, the TRP agent 640 may obtain the state information (based on transport layer information of the group of related flow) by checking and reading 606 its local cache. Accordingly, the source node 630 may, via its TRP agent 640, may read the local cache at the TRP agent 640 to obtain the transport layer information.
In some embodiments, the state information in the local cache may be unreliable. For example, the local cache may be empty (does not have) of the requested state information. In some embodiments, the state information in the local cache may be unreliable due to the requested state information (associated with flow) being stale or old based on the last time the state information was updated, i.e., based on the associated time T. In some embodiments, the state information in the local cache may be unreliable due to the state information (e.g., one or more metrics based on transport layer information) indicating a value that is substantially different from an expected value.
If the state information in the local cache is unreliable for one or more reasons, the TRP agent 640 may obtain the requested state information from another TRP agent, a second TRP agent 642, at another node (e.g., a neighbor node or a remote node) of a second network. Another node may be a second node in the same network as that of source node 630 or a different network. The TRP agent 640 may send a TRP query message 608 to the second TRP agent 642 requesting the requested state information. The second TRP agent 642 may send a TRP response message 610 to the TRP agent 640, the TRP response message 610 comprising the requested state information. In some embodiments, the TRP agent 640 may send a TRP update message 607 to one or more other TRP agents (e.g., TRP agent 642) at one or more other nodes of the network notifying the one or more other TRP agents of the arrival of the flow. In some embodiments, the TRP update message 607 and the TRP query message 608 may be sent in a same message.
Based on the received state information associated with the flow, the source node 630 may determine 614 a time and a rate for transmitting data of the flow. The source node 630 may then send the data 616 based on the determined time and rate.
FIG. 7 illustrates another method of managing a flow, according to an embodiment. The method 700 may relate to managing a flow, at the source node 630, when a transmission impairment event associated with a flow occurs. In some embodiments, method 700 may operate independent of or in combination with method 600. A transmission impairment event may be a congestion notification signal, delay, a packet drop, or any other event that impairs the transmission of the flow.
In an embodiment, the source node 630 may be sending data 702 of a flow to a destination node 644. At some point during transmission, the flow may experience a transmission impairment event, for example, a packet drop 704. The source node 630 may then notify its TRP agent 640 of the transmission impairment event by sending a TRP update message 705. The source node 630 may further request from TRP agent 640 to collect information about the network state associated with the flow. For example, the source node 630 may send a TRP query message 706 to TRP agent 640 requesting state information of a group of flows related to or associated with the flow. The group of flows may include one or more flows that are related to the flow based on one or more of: sharing a common path and being directed to or toward a same destination (e.g., destination node 644) . In some embodiments, the TRP update message 705 and the TRP query message 706 may be sent in one message.
TRP agent 640 may obtain the requested state information and include the obtained state information in a TRP response message 714 and send it to the source node 630. In an embodiment, the TRP agent 640 may obtain the transport layer information by checking 708 its local cache.
In some embodiments, the state information in the local cache may be unreliable. For example, the local cache may be empty (does not have) of the requested state information. In some embodiments, the state information in the local cache may be unreliable due to the requested state information (associated with flow) being stale or old based on the last time the state information was updated, i.e., based on the associated time T. In some embodiments, the state information in the local cache may be unreliable due to the state information (e.g., one or more metrics based on transport layer information) indicating a value that is substantially different from an expected value.
If the state information in the local cache is unreliable for one or more reasons, the TRP agent 640 may obtain the requested state information from another TRP agent, a second TRP agent 642, at another node of the network (e.g., a neighbor node or a remote node) . The TRP agent 640 may send a TRP query message 710 to the second TRP agent 642 requesting the requested state information. The second TRP agent 642 may send a TRP response message 712 to the TRP agent 640, the TRP response message 712 comprising the requested state information. In some embodiments, the TRP agent 640 may send a TRP update message 709 to one or more other TRP agents (e.g., TRP agent 642) at one or more other nodes of the network notifying the one or more other TRP agents of the transmission impairment event associated with the flow. In some embodiments, the TRP update message 709 and the TRP query message 710 may be sent in a same message.
Based on the received state information associated with the flow, the source node 630 may determine 716 a new or a second transmission rate which is a suitable transmission rate that is more in line with the state of the network according to the received state information. The source node 630 may then send the data 718 based on the determined transmission rate. Accordingly, the source node 630 may adjust its transmission rate to an adaptive or dynamic rate based on the latest state of the network. This allows the node to respond to changes in network congestion, packet loss, and other factors, ensuring that it sends data at a rate that is suitable or adequate for the latest state of the network.
At some point during the transmission of the flow, the flow may depart 720 from the source node 630. In an embodiment, after the flow departs from the source node 630, the source node 630 may notify its TRP agent 640 of the departure of the flow by sending a TRP update message 722. The TRP agent 640 may then send a TRP update message 724 to one or more other TRP agents (e.g., TRP agent 642) at one or more other nodes of the network notifying the one or more other TRP agents of the departure of the flow from the source node.
FIG. 6 and 7 and their corresponding methods 600 and 700 may together illustrate a lifecycle of a flow and lifecycle events that may occur during the lifecycle: arrival of the flow, a transmission impairment event, and departure of the flow.
According to an aspect, the TRP agent 400 may comprise one or more interfaces for communicating with: the node at which the TRP agent is implemented, and one or more other TRP agents at one or more other nodes of the network. The one or more interfaces may include a TRP interface that interacts with a local transport at end hosts (e.g., where the TRP agent is implemented at end hosts) . The TRP interface may be generic interface that is compatible with different transport technologies such as TCP or RDMA. In an embodiment, the TRP interface may be an interface based on extended Berkeley Packet Filter (eBPF) .
As may be appreciated, one feature of the eBPF is the hook points that allow a Berkeley Packet Filter (BPF) program to collect and share information with the transport. Another feature of the eBPF is its BPF filesystem and the maps that it provides for state management. While each application can have its own local memory section to operate on, these maps can also be shared between different applications. Further, the user-level applications can modify these maps as well. Therefore, TCS may leverage eBPF to allow for observing a packet at different stages of its data path and share custom information between these stages and other flows. TCS may further leverage eBPF to provide the capability to be extended to NICs and Data Plane Development Kit (DPDK) , which allows compatibility across different transport technologies.
As described herein, according to an aspect, transport layer information (TI) of a plurality of nodes (e.g., hosts, neighbors, switches) of a network may be aggregated to allow for improved network visibility and improved and timely transport decisions, such as change congestion window or data transmission rate.
According to an aspect, an out-of-band TRP message may be used as a way to collect and share collect TI. The TRP may allow for simplifying congestion control design (CC) and be applicable to high speed networks. According to an aspect, congestion sensing may be separated from congestion control, which may reduce overhead related to gathering congestion information and may enable proactive transport design. According to an aspect, the need for slow start phase in TCP design may be obviated, which may enable fast convergence.
FIG. 8 illustrates another method of managing a flow, according to an embodiment. The method 800 may be a method for managing flows. The method 800 may include detecting 801, by a node of a plurality of nodes of a network, a lifecycle event associated with a first flow. The method 800 may further include managing 802, by the node, a local cache 402 based on the lifecycle event. The local cache may be configured to store state information of one or more groups of flows. Each group of the one or more groups of flows may include one or more flows associated based on at least one of: sharing a common path in the network and being directed toward a same destination address. The state information for each said group may include one or more metrics based on transport layer information of said group of flows.
The lifecycle event may indicate an arrival of the first flow 602 at the node. The first flow may be destined to a first destination address. Managing 802, by the node, the local cache based on the lifecycle event may include obtaining, by the node, state information of a first group of flows of the one or more groups of flows. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward the first destination address. The method may further include sending, by the node, data of the first flow at a rate determined based on the obtained state information.
Obtaining, by the node, state information of the first group of flows may include retrieving 606, by the node from the local cache, the state information of the first group of flows.
Obtaining, by the node, state information of the first group of flows may include sending, by the node to a second node of a second network, a request message 608 requesting the state information of the first group of flows. The second network may be the same as or different from the network. Obtaining, by the node, state information of the first group of flows may further include receiving, by the node from the second node, a response message 610 including the state information of the first group of flows.
The method may further include sending, by the node to a second node of a second network, a request message requesting the state information of the first group of flows. The second network may be the same as or different from the network. The method may further include receiving, by the node from the second node, a response message including the state information of the first group of flows. The method may further include updating, by the node, the local cache based on the received state information of the first group of flows.
The method may further include determining, by the node, that the state information of the first group of flows in the local cache is unreliable based on one or more of: local cache being empty of the state information of the first group of flows; the state information of the first group of flows at the local cache being stale; and the state information of the first group of flows indicating a value that is substantially different from an expected value.
The lifecycle event may indicate a departure 720 of the first flow from the node. Managing, by the node, the local cache based on the lifecycle event may include updating, by the node, the local cache based on the lifecycle event. Managing, by the node, the local cache based on the lifecycle event may further include sending, by the node to a second node of a second network, a cache update message 724 indicating that the first flow has departed from the node. The second network may be the same as or different from the network.
The lifecycle event may indicate a transmission impairment 704 of the first flow. Managing, by the node, the local cache based on the lifecycle event may include sending, by the node to a second node of a second network, a cache update message 709 indicating the transmission impairment event, wherein the second network is same as or different from the network. The transmission impairment may be one or more of: a congestion notification signal, delay, and a packet drop.
The lifecycle event may be a passage of time since the state information of a first group of flows of the one or more group of flows last updated. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. Managing, by the node, a local cache based on the lifecycle event may include sending, by the node to a second node of a second network, a request message for updated state information of the first group of flows. The second network may be the same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include receiving, by the node from the second node, a response message including the updated state information of the first group of flows.
The lifecycle event may be a passage of time since a last cache update message was sent to a second node of a second network. The last cache update message may have been sent at a last update time and include the state information of a first group of flows of the one or more group of flows in the local cache at the last update time. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address, wherein the second network is same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include sending, by the node to the second node at a second time after the last update time, an updated cache update message including state information of the first group of flows in the local cache at the second time.
Detecting, by the node of the network, the lifecycle event associated with the first flow may include receiving, by the node from a second node of a second network, a request message for state information of a first group of flows of the one or more groups. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. The second network may be the same as or different from the network. Managing, by the node, a local cache based on the lifecycle event may include retrieving, by the node from the local cache, the state information of the first group of flows. The method may further include sending, by the node to the second node, a response message including the retrieved state information of the first group of flows.
Detecting, by the node of the network, the lifecycle event associated with the first flow may include receiving, by the node from a second node of a second network, a cache update message. The second network may be the same as or different from the network. The cache update message may indicate one of: an arrival of the first flow at the second node; a departure of the first flow from the second node; a transmission impairment event of the first flow; and an updated state information of a first group of flows of the one or more groups of flow. The first group of flows may be associated with the first flow based on at least one of: sharing a common path and being directed toward a destination address. Managing, by the node, a local cache based on the lifecycle event may include updating, by the node, the local cache based on the cache update message.
Each of the request message and the response message may include a packet indicating one or more of: a source address, a destination address, a source port, a destination port, a length of the packet, a type of message, the one or more metrics of the first group of flows associated with the destination address, the destination address. Each of the request message and the response message may be communicated via a protocol for sharing the state information of the one or more groups of flows among the plurality of nodes of the network.
The cache update message may include a packet 500 indicating one or more of: a source address, a destination address, a source port, a destination port, a length of the packet, a type of message, one or more metrics of the first group of flows associated with the destination address, the destination address. The cache update message may be communicated via a protocol for sharing the state information of the one or more groups of flows among the plurality of nodes of the network.
The state information for each group of flows of the one or more groups of flows may further include a last update time indicating a last time said the state information was updated. The one or more metrics may indicate one or more of: a congestion window (CWND) , a round-trip time (RTT) , a flow completion time (FCT) , a number of active flows, a number of early congestion notifications (ECN) , a number of packet drops. At least one of the one or more metrics may be an aggregated value of the one or more flows of said each group of flows.
FIG. 9 illustrates an apparatus 900 that may perform any or all of operations of the above methods and features explicitly or implicitly described herein, according to different aspects of the present disclosure. For example, a computer equipped with network function may be configured as the apparatus 900. In some aspect, apparatus 900 can be a device that connects to the network infrastructure over a radio interface, such as a mobile phone, smart phone or other such device that may be classified as user equipment (UE) . In some aspects, the apparatus 900 may be a Machine Type Communications (MTC) device (also referred to as a machine-to-machine (m2m) device) , or another such device that may be categorized as a UE despite not providing a direct service to a user. In some aspects, apparatus 900 may be configured or used to implement one or more aspects described herein. For example, apparatus 900 may be a network node involved in one or more operations or methods described herein, a host, an end host, a switch, a router, the source node 630, and the TRP agent 640 or 642.
As shown, the apparatus 900 may include a processor 910, such as a Central Processing Unit (CPU) or specialized processors such as a Graphics Processing Unit (GPU) or other such processor unit, memory 920, non-transitory mass storage 930, input-output interface 940, network interface 950, and a transceiver 960, all of which are communicatively coupled via bi-directional bus 970. Transceiver 960 may include one or multiple antennas According to certain aspects, any or all of the depicted elements may be utilized, or only a subset of the elements. Further, apparatus 900 may contain multiple instances of certain elements, such as multiple processors, memories, or transceivers. Also, elements of the hardware device may be directly coupled to other elements without the bi-directional bus. Additionally, or alternatively to a processor and memory, other electronics or processing electronics, such as integrated circuits, application specific integrated circuits, field programmable gate arrays, digital circuitry, analog circuitry, chips, dies, multichip modules, substrates or the like, or a combination thereof may be employed for performing the required logical operations.
The memory 920 may include any type of non-transitory memory such as static random-access memory (SRAM) , dynamic random-access memory (DRAM) , synchronous DRAM (SDRAM) , read-only memory (ROM) , any combination of such, or the like. The mass storage element 930 may include any type of non-transitory storage device, such as a solid-state drive, hard disk drive, a magnetic disk drive, an optical disk drive, USB drive, or any computer program product configured to store data and machine executable program code. According to certain aspects, the memory 920 or mass storage 930 may have recorded thereon statements and instructions executable by the processor 910 for performing any of the aforementioned method operations described above.
Aspects of the present disclosure can be implemented using electronics hardware, software, or a combination thereof. In some aspects, this may be implemented by one or multiple computer processors executing program instructions stored in memory. In some aspects, the invention is implemented partially or fully in hardware, for example using one or more field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs) to rapidly perform processing operations.
It will be appreciated that, although specific aspects of the technology have been described herein for purposes of illustration, various modifications may be made without departing from the scope of the technology. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention. In particular, it is within the scope of the technology to provide a computer program product or program element, or a program storage or memory device such as a magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine, for controlling the operation of a computer according to the method of the technology and/or to structure some or all of its components in accordance with the system of the technology.
Acts associated with the method described herein can be implemented as coded instructions in a computer program product. In other words, the computer program product is a computer-readable medium upon which software code is recorded to execute the method when the computer program product is loaded into memory and executed on the microprocessor of the wireless communication device.
Further, each operation of the method may be executed on any computing device, such as a personal computer, server, PDA, or the like and pursuant to one or more, or a part of one or more, program elements, modules or objects generated from any programming language, such as C++, Java, or the like. In addition, each operation, or a file or object or the like implementing each said operation, may be executed by special purpose hardware or a circuit module designed for that purpose.
Through the descriptions of the preceding aspects, the present invention may be implemented by using hardware only or by using software and a necessary universal hardware platform. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which can be a compact disc read-only memory (CD-ROM) , USB flash disk, or a removable hard disk. The software product includes a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the aspects of the present invention. For example, such an execution may correspond to a simulation of the logical operations as described herein. The software product may additionally or alternatively include a number of instructions that enable a computer device to execute operations for configuring or programming a digital logic apparatus in accordance with aspects of the present invention.
Although the present invention has been described with reference to specific features and aspects thereof, it is evident that various modifications and combinations can be made thereto without departing from the invention. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention.