CLAIM OF BENEFIT TO PRIOR APPLICATIONSThis application is a continuation application of U.S. patent application Ser. No. 13/757,594, filed on Feb. 1, 2013 and now published as U.S. Patent Publication 2013/0148656. U.S. patent application Ser. No. 13/757,594 is a continuation application of U.S. patent application Ser. No. 13/589,062, filed on Aug. 17, 2012 and now issued as U.S. Pat. No. 9,369,426. U.S. patent application Ser. No. 13/589,062 claims the benefit of U.S. Provisional Patent Application 61/524,754, filed Aug. 17, 2011; U.S. Provisional Patent Application 61/643,339, filed May 6, 2012; U.S. Provisional Patent Application 61/654,121, filed Jun. 1, 2012; and U.S. Provisional Patent Application 61/666,876, filed Jul. 1, 2012. U.S. patent application Ser. No. 13/757,594 claims the benefit of U.S. Provisional Patent Application 61/643,339, filed May 6, 2012; U.S. Provisional Patent Application 61/654,121, filed Jun. 1, 2012; and U.S. Provisional Patent Application 61/666,876, filed Jul. 1, 2012. U.S. patent application Ser. No. 13/757,594, now published as U.S. Patent Publication 2013/0148656, and Ser. No. 13/589,062, now issued as U.S. Pat. No. 9,369,426, and U.S. Provisional Patent Applications 61/524,754, 61/643,339, 61/654,121, and 61/666,876, are incorporated herein by reference.
BACKGROUNDMany current enterprises have large and sophisticated networks comprising switches, hubs, routers, servers, workstations and other networked devices, which support a variety of connections, applications and systems. The increased sophistication of computer networking, including virtual machine migration, dynamic workloads, multi-tenancy, and customer specific quality of service and security configurations require a better paradigm for network control. Networks have traditionally been managed through low-level configuration of individual components. Network configurations often depend on the underlying network: for example, blocking a user's access with an access control list (“ACL”) entry requires knowing the user's current IP address. More complicated tasks require more extensive network knowledge: forcing guest users'port 80 traffic to traverse an HTTP proxy requires knowing the current network topology and the location of each guest. This process is of increased difficulty where the network switching elements are shared across multiple users.
In response, there is a growing movement towards a new network control paradigm called Software-Defined Networking (SDN). In the SDN paradigm, a network controller, running on one or more servers in a network, controls, maintains, and implements control logic that governs the forwarding behavior of shared network switching elements on a per user basis. Making network management decisions often requires knowledge of the network state. To facilitate management decision-making, the network controller creates and maintains a view of the network state and provides an application programming interface upon which management applications may access a view of the network state.
Some of the primary goals of maintaining large networks (including both datacenters and enterprise networks) are scalability, mobility, and multi-tenancy. Many approaches taken to address one of these goals results in hampering at least one of the others. For instance, one can easily provide network mobility for virtual machines within an L2 domain, but L2 domains cannot scale to large sizes. Furthermore, retaining user isolation greatly complicates mobility. As such, improved solutions that can satisfy the scalability, mobility, and multi-tenancy goals are needed.
BRIEF SUMMARYSome embodiments in some cases model logical routing as an act of interconnecting two or more logical datapath (LDP) sets operating in L2 domains by a logical router that implements a logical datapath set (LDPS) operating in an L3 domain. A packet traversing from a logical L2 domain to another will take the following four steps in some embodiments. These four steps are described below in terms of the logical processing operations that the network control system implements. However, it is to be understood that these operations are performed by the managed switching elements of the network based on the physical control plane data that is produced by the network control system.
First, the packet will be processed through an L2 table pipeline of the originating logical L2 domain. The pipeline will conclude with the destination media access control (MAC) address being forwarded to a logical port attached to a logical port of a logical router.
Second, the packet will be processed though a logical router's L3 datapath, again by sending it through this router's L3 table pipeline. The L2 lookup stage common in physical routers is skipped in the router's L3 datapath in some embodiments, as the logical router will only receive packets requiring routing.
In some embodiments, the L3 forwarding decision will use the prefix (forwarding information base (FIB) entries that are provisioned by the logical control plane of the logical router. In some embodiments, a control application is used to receive the logical control plane data, and to convert this data to logical forwarding plane data that is then supplied to the network control system. For the L3 forwarding decision, some embodiments use the prefix FIB entries to implement longest prefix matching.
As a result, the L3 router will forward the packet to the logical port that is “connected” to the destination L2 LDPS. Before forwarding the packet further to that LDPS, the L3 router will change the originating MAC address to one that is defined in its domain as well as resolve the destination IP address to a destination MAC address. The resolution is executed by the last “IP output” stage of the L3 data pipeline in some embodiments. The same pipeline will decrement TTL and update the checksum (and respond with ICMP if TTL goes to zero).
It should be noted that some embodiments rewrite the MAC address before feeding the processed packet to the next LDPS, because without this rewriting a different forwarding decision could result at the next LDPS. It should also be noted that even though traditional routers execute the resolution of the destination IP address using ARP, some embodiments do not employ ARP for this purpose in the L3 logical router because as long as the next-hop is a logical L2 datapath, this resolution remains internal to the virtualization application.
Third, the packet will be processed through an L2 table pipeline of the destination logical L2 domain. The destination L2 table pipeline determines the logical egress port along which it should send the packet. In case of an unknown MAC address, this pipeline would resolve the MAC address location by relying on some distributed lookup mechanism. In some embodiments, the managed switching elements rely on a MAC learning algorithm, e.g., they flood the unknown packets. In these or other embodiments, the MAC address location information can also be obtained by other mechanisms, for instance out-of-band. If such a mechanism is available in some embodiments, the last logical L2 table pipeline uses this mechanism to obtain the MAC address location.
Fourth, the packet gets sent to the logical port attached to the physical port representing the logical port attachment. At this stage, if the port is point-to-point media (e.g., virtual network interface, VIF), there's nothing left to do but to send the packet to the port. However, if the last LDPS was an L3 router and hence the attachment is a physical L3 subnet, the attachment point, in some embodiments, resolves the destination IP address by using ARP before sending the packet out. In that case, the source MAC address would be egress specific and not the logical MAC interface address in case of a VIF. In other embodiments, resolving the destination IP address by using ARP is performed during the second step by the L3 logical router.
In the example above, there's only a single logical router interconnecting logical L2 datapaths, but nothing limits the topologies. One of ordinary skill in the art will recognize that more LDP sets can be interconnected for richer topologies.
In some embodiments, the control application allows an L3 specific logical state to be defined in terms of one or more tables that specify a logical L3 pipeline. The corresponding logical control plane managing the LDPS pipeline can either rely on static route configuration, or peer with other LDP sets over a standard routing protocol.
In some embodiments, the virtualization application defines the physical realization of the above-described, four-step L2/L3 packet processing into physical control plane data, which when translated into physical forwarding data by the managed switching elements, effectuates a sequence of logical pipeline executions that are all or predominantly performed at the first-hop, managed edge switching element. In order to maintain the locality of the physical traffic, the first-hop executes the series of pipelines (with all state required) and directly sends the traffic towards the ultimate egress location in the physical network. When short cut tunnels are used, the virtualization application interconnects logical L2 datapaths with logical L3 datapaths by extending the short-cut tunnel mesh beyond a single LDPS to a union of ports of all the interconnected LDP sets. When everything is executed at the first-hop, the first-hop elements typically have access to all the states of the logical network through which the packet traverses.
The preceding Summary is intended to serve as a brief introduction to some embodiments of the invention. It is not meant to be an introduction or overview of all inventive subject matter disclosed in this document. The Detailed Description that follows and the Drawings that are referred to in the Detailed Description will further describe the embodiments described in the Summary as well as other embodiments. Accordingly, to understand all the embodiments described by this document, a full review of the Summary, Detailed Description and the Drawings is needed. Moreover, the claimed subject matters are not to be limited by the illustrative details in the Summary, Detailed Description and the Drawing, but rather are to be defined by the appended claims, because the claimed subject matters can be embodied in other specific forms without departing from the spirit of the subject matters.
BRIEF DESCRIPTION OF THE DRAWINGSThe novel features of the invention are set forth in the appended claims. However, for purpose of explanation, several embodiments of the invention are set forth in the following figures.
FIG. 1 conceptually illustrates a network architecture of some embodiments.
FIG. 2 conceptually illustrates a processing pipeline of some embodiments for processing network data through logical switches and logical routers.
FIG. 3 conceptually illustrates a network architecture in which a logical router is implemented in a single L3 router.
FIG. 4 conceptually illustrates a network architecture in which a logical router is implemented in a managed switching element.
FIG. 5 conceptually illustrates a network architecture in which a router is implemented in a distributed manner such that each of several managed switching elements routes packets at L3.
FIG. 6 conceptually illustrates an example implementation of the logical processing pipeline described above by reference toFIG. 2.
FIG. 7 conceptually illustrates the logical processing pipeline of some embodiments for processing a packet through a logical switch, a logical router, and a logical switch.
FIG. 8 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 9 conceptually illustrates an example network architecture of some embodiments which implements the logical router and logical switches.
FIG. 10 conceptually illustrates an example network architecture of some embodiments which implements the logical router and logical switches.
FIG. 11 conceptually illustrates an example architecture of a host of some embodiments that includes a managed switching element and a L3.
FIG. 12 conceptually illustrates an example implementation of logical switches and logical routers in managed switching elements and L3 routers.
FIGS. 13A-13C conceptually illustrate an example operation of logical switches, a logical router implemented in managed switching elements and a L3 router described above by reference toFIG. 12.
FIG. 14 conceptually illustrates a process that some embodiments perform to forward a packet to determine to which managed switching element to send a packet.
FIG. 15 conceptually illustrates the host as described above by reference toFIG. 8.
FIG. 16 conceptually illustrates a process that some embodiments use to directly forward a packet from a first L3 router to a second L3 router when the first and the second L3 routers are implemented in the same host.
FIG. 17 conceptually illustrates an example implementation of the logical processing pipeline described above by reference toFIG. 2.
FIG. 18 conceptually illustrates a logical processing pipeline of some embodiments for processing a packet through a logical switch, a logical router, and another logical switch.
FIG. 19 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 20 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 21 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 22 conceptually illustrates an example architecture of a host of some embodiments that includes a managed switching element that implements a logical router based on flow entries.
FIG. 23 conceptually illustrates an example implementation of logical switches and logical routers in managed switching elements.
FIG. 24 conceptually illustrates an example operation of logical switches, a logical router, and managed switching elements described above by reference toFIG. 23.
FIG. 25 conceptually illustrates an example implementation of a logical processing pipeline described above by reference toFIG. 2.
FIG. 26 conceptually illustrates a logical processing pipeline of some embodiments for processing a packet through a logical switch, a logical router, and another logical switch.
FIG. 27 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 28 conceptually illustrates an example network architecture of some embodiments which implements a logical router and logical switches.
FIG. 29 conceptually illustrates an example of a first-hop switching element that performs all of L2 and L3 processing on a received packet to forward and route.
FIGS. 30A-30B conceptually illustrate an example operation of logical switches, a logical router, and managed switching elements described above by reference toFIG. 29.
FIG. 31 conceptually illustrates an example software architecture of a host on which a managed switching element runs.
FIG. 32 conceptually illustrates a process that some embodiments perform to translate network addresses.
FIG. 33 conceptually illustrates that a first-hop switching element of some embodiments performs the entire logical processing pipeline including the NAT operation.
FIG. 34 conceptually illustrates an example that a managed switching element does not perform a logical processing pipeline when sending a returning packet to a managed switching element.
FIG. 35 conceptually illustrates a process that some embodiments perform to send a packet to a destination machine whose address is NAT'ed.
FIG. 36 illustrates an example of migrating NAT state from a first host to a second host as a VM migrates from the first host to the second host.
FIG. 37 illustrates another example of migrating NAT state from a first host to a second host as a VM migrates from the first host to the second host.
FIG. 38 illustrates an example physical implementation of logical switches and a logical router that performs load balancing.
FIG. 39 illustrates another example physical implementation of logical switches and a logical router that performs load balancing.
FIG. 40 illustrates yet another example physical implementation of logical switches and a logical router that performs load balancing.
FIG. 41 conceptually illustrates a load balancing daemon that balances load among the machines that collectively provides a service (e.g., web service).
FIG. 42 illustrates a DHCP daemon that provides DHCP service to different logical networks for different users.
FIG. 43 illustrates a central DHCP daemon and several local DHCP daemons.
FIG. 44 conceptually illustrates an example of performing some logical processing at the last hop switching element.
FIGS. 45A-45B conceptually illustrate an example operation of logical switches, a logical router, and managed switching elements described above by reference toFIG. 44.
FIG. 46 conceptually illustrates an example of performing some logical processing at the last hop switching element.
FIGS. 47A-47B conceptually illustrate an example operation of logical switches, a logical router, and managed switching elements described above by reference toFIG. 46.
FIG. 48 conceptually illustrates an example software architecture of a host on which a managed switching element runs.
FIG. 49 conceptually illustrates a process that some embodiments perform to resolve network addresses.
FIG. 50 illustrates a map server that allows several hosts (or VMs) that each run an L3 daemon to avoid broadcasting ARP requests.
FIG. 51 illustrates a process that some embodiments perform to maintain a mapping table that includes mappings of IP and MAC addresses.
FIG. 52 illustrates a process that some embodiments perform to maintain a mapping table that includes mappings of IP and MAC addresses.
FIG. 53 conceptually illustrates a controller instance of some embodiments generate flows by performing table mapping operations on tables using a table mapping processor (not shown) such as an nLog.
FIG. 54 illustrates an example architecture and a user interface.
FIG. 55 illustrates tables before a stage described above by reference toFIG. 54.
FIG. 56 illustrates tables after the user supplies a logical port's identifier, an IP address to associate with the port, and a net mask to add the logical port to the logical router.
FIG. 57 illustrates a result of a set of table mapping operations.
FIG. 58 illustrates a result of a set of table mapping operations.
FIG. 59 illustrates tables after the stage described above by reference toFIG. 54.
FIG. 60 illustrates a result of a set of table mapping operations.
FIG. 61 illustrates a result of a set of table mapping operations.
FIG. 62 illustrates new rows added to some of the tables after stages described above by reference toFIG. 61.
FIG. 63 illustrates a architecture after a control application generates logical data by performing a table mapping operations as described above by reference toFIGS. 55-62.
FIG. 64 conceptually illustrates an electronic system with which some embodiments of the invention are implemented.
DETAILED DESCRIPTIONSome embodiments of the invention provide a network control system that allows logical datapath (LDP) sets (e.g., logical networks) to be implemented by switching elements of a physical network. To implement LDP sets, the network control system of some embodiments generates physical control plane data from logical forwarding plane data. The physical control plane data is then pushed to the managed switching elements, where it is typically converted into physical forwarding plane data that allows the managed switching elements to perform their forwarding decisions. Based on the physical forwarding data, the managed switching elements can process data packets in accordance with the logical processing rules specified within the physical control plane data.
A single logical datapath set provides switching fabric to interconnect a number of logical ports, which can be either attached to physical or virtual endpoints. In some embodiments, the creation and use of such LDP sets and logical ports provides a logical service model that corresponds to a virtual local area network (VLAN). This model, in some embodiments, limits the operations of the network control system to defining only logical L2 switching capabilities. However, other embodiments extend the operations of the network control system to both the logical L2 switching capabilities and the logical L3 switching capabilities.
The network control system of some embodiments supports the following logical L3 switching capabilities.
- Logical routing. Instead of performing just L2 switching for packets, the network control system of some embodiments also defines the physical control plane data to direct the managed switching elements to forward packets based on Internet Protocol (IP) addresses when crossing L2 broadcast domains (IP subnets). Such logical L3 routing resolves the scalability issues of L2 networks.
- Gateway virtualization. Instead of interfacing with external networks by using a purely L2 interface, the network control system of some embodiments can use an IP interface to interact with external networks. In some embodiments, the network control system defines such an IP interface by defining a single logical gateway even when multiple physical egress and ingress points to and from the external networks exist. Accordingly, some embodiments interface with external IP networks by using gateway virtualization.
- Network Address Translation. An entire L3 subnet may be network address translated (NAT'ed). In some embodiments, the logical network uses private addresses and exposes only NAT'ed IP addresses for external networks. Moreover, in some embodiments, the subnets of the logical network interconnect over NATs or use destination NAT'ing to implement fine-grained application level routing decisions.
- Stateful filtering. Similar to NAT'ing, some embodiments isolate subnets from the external network by using stateful access control lists (ACLs). Also, some embodiments place ACLs between the logical subnets.
- Load-balancing. In some cases, the logical network is used to provide services. For these and other cases, the network control system provides virtual IP addresses for the application clusters. In some embodiments, the network control system specifies load-balancing operations that enable spreading incoming application traffic over a set of logical IP addresses.
- DHCP. While a virtual machine (VM) can be set up to provide dynamic IP address allocation services within the logical network, a service provider may prefer more efficient realization of the dynamic host configuration protocol (DHCP) service at the infrastructure level. Accordingly, the network control system of some embodiments provides an efficient realization of the DHCP service at the infrastructure level.
The design for each of these L3 features will be described below. Implementation-wise the features are largely orthogonal, so one of ordinary skill will realize that these features do not all have to be offered by a network control system of some embodiments. Before describing the features further, several assumptions should be mentioned. These assumptions are as follows.
- Large networks. Logical L3 networks spanning multiple L2 networks will be larger than the logical L2 networks. Some embodiments solve logical L3 problems for server clusters as large as 10K servers by using a map-reduce distributed processing technique.
- Physical traffic non-locality. Logical subnets within a data center may exchange significant traffic within the data center. Some embodiments preserve the traffic locality to the extent that this is possible. In the above-mentioned map-reduce example, the traffic has no locality in terms of endpoints.
- Logical traffic locality. There is indeed locality when it comes to the traffic exchanged between the logical subnets. In other words, not every logical network has clients for the map-reduce cluster mentioned above.
- Placement of the functionalities. As mentioned in U.S. patent application Ser. No. 13/177,535, which is incorporated herein by reference, the managed switching elements, in some embodiments, are (1) edge switching elements of a physical network (i.e., switching elements that have direct connections with the virtual or physical computing devices connected by the physical network), and (2) non-edge switching elements that are inserted in the managed-switching element hierarchy to simplify and/or facilitate the operation of the controlled edge switching elements. As further described in U.S. patent application Ser. No. 13/177,535, the edge switching elements include, in some embodiments, (1) switching elements that have direct connections with the virtual or physical computing devices connected by the network, and (2) integration elements (called extenders) that connect a first managed portion of the network to a second managed portion of the network (e.g., a portion in a different physical location than the first managed portion), or to an unmanaged portion of the network (e.g., to the internal network of an enterprise). Some embodiments perform the logical L3 routing ideally at the first managed edge switching element, i.e., at the first-hop edge switching element, which may be implemented in the hypervisor that also hosts the virtual machines interconnected by the physical network. Ideally, the first-hop switching element performs all or most of the L3 routing because the network control system of some embodiments can then consider the non-edge switching elements (internal network) as nothing but a fabric for interconnecting the devices.
Some of the embodiments described below are implemented in a novel distributed network control system that is formed by one or more controllers (also called controller instances below) for managing one or more shared forwarding elements. The shared forwarding elements in some embodiments can include virtual or physical network switches, software switches (e.g., Open vSwitch), routers, and/or other switching devices, as well as any other network elements (such as load balancers, etc.) that establish connections between these switches, routers, and/or other switching devices. Such forwarding elements (e.g., physical switches or routers) are also referred to below as switching elements. In contrast to an off the shelf switch, a software forwarding element is a switch that in some embodiments is formed by storing its switching table(s) and logic in the memory of a standalone device (e.g., a standalone computer), while in other embodiments, it is a switch that is formed by storing its switching table(s) and logic in the memory of a device (e.g., a computer) that also executes a hypervisor and one or more virtual machines on top of that hypervisor.
In some embodiments, the controller instances allow the system to accept logical datapath sets from users and to configure the switching elements to implement these logical datapath sets. In some embodiments, one type of controller instance is a device (e.g., a general-purpose computer) that executes one or more modules that transform the user input from a logical control plane to a logical forwarding plane, and then transform the logical forwarding plane data to physical control plane data. These modules in some embodiments include a control module and a virtualization module. A control module allows a user to specify and populate logical datapath set, while a virtualization module implements the specified logical datapath set by mapping the logical datapath set onto the physical switching infrastructure. In some embodiments, the control and virtualization applications are two separate applications, while in other embodiments they are part of the same application.
From the logical forwarding plane data for a particular logical datapath set, the virtualization module of some embodiments generates universal physical control plane (UPCP) data that is generic for any managed switching element that implements the logical datapath set. In some embodiments, this virtualization module is part of a controller instance that is a master controller for the particular logical datapath set. This controller is referred to as the logical controller.
In some embodiments, the UPCP data is then converted to customized physical control plane (CPCP) data for each particular managed switching element by a controller instance that is a master physical controller instance for the particular managed switching element, or by a chassis controller for the particular managed switching element, as further described in U.S. patent application Ser. No. 13/589,077, filed Aug. 17, 2012, which is incorporated herein by reference. When the chassis controller generates the CPCP data, the chassis controller obtains the UPCP data from the virtualization module of the logical controller through the physical controller.
Irrespective of whether the physical controller or chassis controller generate the CPCP data, the CPCP data for a particular managed switching element needs to be propagated to the managed switching element. In some embodiments, the CPCP data is propagated through a network information base (NIB) data structure, which in some embodiments is an object-oriented data structure. Several examples of using the NIB data structure are described in U.S. patent application Ser. Nos. 13/177,529 and 13/177,533, which are incorporated herein by reference. As described in these applications, the NIB data structure is also used in some embodiments to may serve as a communication medium between different controller instances, and to store data regarding the logical datapath sets (e.g., logical switching elements) and/or the managed switching elements that implement these logical datapath sets.
However, other embodiments do not use the NIB data structure to propagate CPCP data from the physical controllers or chassis controllers to the managed switching elements, to communicate between controller instances, and to store data regarding the logical datapath sets and/or managed switching elements. For instance, in some embodiments, the physical controllers and/or chassis controllers communicate with the managed switching elements through OpenFlow entries and updates over the configuration protocol. Also, in some embodiments, the controller instances use one or more direct communication channels (e.g., RPC calls) to exchange data. In addition, in some embodiments, the controller instances (e.g., the control and virtualization modules of these instance) express the logical and/or physical data in terms of records that are written into the relational database data structure. In some embodiments, this relational database data structure are part of the input and output tables of a table mapping engine (called nLog) that is used to implement one or more modules of the controller instances.
I. Logical Routing
Some embodiments in some cases model logical routing as an act of interconnecting two or more LDP sets operating in L2 domains by a logical router that implements a LDPS operating in an L3 domain. A packet traversing from a logical L2 domain to another will take the following four steps in some embodiments. These four steps are described below in terms of the logical processing operations that the network control system implements. However, it is to be understood that these operations are performed by the managed switching elements of the network based on the physical control plane data that is produced by the network control system.
First, the packet will be processed through an L2 table pipeline of the originating logical L2 domain. The pipeline will conclude with the destination media access control (MAC) address being forwarded to a logical port attached to a logical port of a logical router.
Second, the packet will be processed though a logical router's L3 datapath, again by sending it through this router's L3 table pipeline. The L2 lookup stage common in physical routers is skipped in the router's L3 datapath in some embodiments, as the logical router will only receive packets requiring routing.
In some embodiments, the L3 forwarding decision will use the prefix (forwarding information base (FIB) entries that are provisioned by the logical control plane of the logical router. In some embodiments, a control application is used to receive the logical control plane data, and to convert this data to logical forwarding plane data that is then supplied to the network control system. For the L3 forwarding decision, some embodiments use the prefix FIB entries to implement longest prefix matching.
As a result, the L3 router will forward the packet to the logical port that is “connected” to the destination L2 LDPS. Before forwarding the packet further to that LDPS, the L3 router will change the originating MAC address to one that is defined in its domain as well as resolve the destination IP address to a destination MAC address. The resolution is executed by the last “IP output” stage of the L3 data pipeline in some embodiments. The same pipeline will decrement TTL and update the checksum (and respond with ICMP if TTL goes to zero).
It should be noted that some embodiments rewrite the MAC address before feeding the processed packet to the next LDPS, because without this rewriting a different forwarding decision could result at the next LDPS. It should also be noted that even though traditional routers execute the resolution of the destination IP address using ARP, some embodiments do not employ ARP for this purpose in the L3 logical router because as long as the next-hop is a logical L2 datapath, this resolution remains internal to the virtualization application.
Third, the packet will be processed through an L2 table pipeline of the destination logical L2 domain. The destination L2 table pipeline determines the logical egress port along which it should send the packet. In case of an unknown MAC address, this pipeline would resolve the MAC address location by relying on some distributed lookup mechanism. In some embodiments, the managed switching elements rely on a MAC learning algorithm, e.g., they flood the unknown packets. In these or other embodiments, the MAC address location information can also be obtained by other mechanisms, for instance out-of-band. If such a mechanism is available in some embodiments, the last logical L2 table pipeline uses this mechanism to obtain the MAC address location.
Fourth, the packet gets sent to the logical port attached to the physical port representing the logical port attachment. At this stage, if the port is point-to-point media (e.g., virtual network interface, VIF), there's nothing left to do but to send the packet to the port. However, if the last LDPS was an L3 router and hence the attachment is a physical L3 subnet, the attachment point, in some embodiments, resolves the destination IP address by using ARP before sending the packet out. In that case, the source MAC address would be egress specific and not the logical MAC interface address in case of a VIF. In other embodiments, resolving the destination IP address by using ARP is performed during the second step by the L3 logical router.
In the example above, there's only a single logical router interconnecting logical L2 datapaths, but nothing limits the topologies. One of ordinary skill in the art will recognize that more LDP sets can be interconnected for richer topologies.
In some embodiments, the control application allows an L3 specific logical state to be defined in terms of one or more tables that specify a logical L3 pipeline. The corresponding logical control plane managing the LDPS pipeline can either rely on static route configuration, or peer with other LDP sets over a standard routing protocol.
In some embodiments, the virtualization application defines the physical realization of the above-described, four-step L2/L3 packet processing into physical control plane data, which when translated into physical forwarding data by the managed switching elements, effectuates a sequence of logical pipeline executions that are all or predominantly performed at the first-hop, managed edge switching element. In order to maintain the locality of the physical traffic, the first-hop executes the series of pipelines (with all state required) and directly sends the traffic towards the ultimate egress location in the physical network. When short cut tunnels are used, the virtualization application interconnects logical L2 datapaths with logical L3 datapaths by extending the short-cut tunnel mesh beyond a single LDPS to a union of ports of all the interconnected LDP sets.
When everything is executed at the first-hop, the first-hop elements typically have access to all the states of the logical network through which the packet traverses. The dissemination (and its scaling implications) of the state for the execution of the logical pipelines at the first-hop switching element is described further below.
FIG. 1 conceptually illustrates a network architecture100 of some embodiments. Specifically, this figure illustrates that alogical router105 routes packets between two LDP sets (e.g., logical networks)150 and155. As shown, the network architecture100 includes thelogical router105,logical switches110 and115, and machines120-145.
Thelogical switch110 is a logical switch (or a logical switching element) described in U.S. patent application Ser. No. 13/177,535. Thelogical switch110 is implemented across several managed switching elements (not shown). Thelogical switch110 routes network traffic between the machines120-130 at L2 (layer 2). That is, thelogical switch110 makes switching decisions to route network data at the data link layer between the machines120-130 based on one or more forwarding tables (not shown) that thelogical switch110 has. Thelogical switch110, along with several other logical switches (not shown), routes the network traffic for thelogical network150. Thelogical switch115 is another logical switch. Thelogical switch115 routes the traffic between machines135-145 for thelogical network155.
A logical router in some embodiment routes traffic at L3 (layer 3—network layer) between different logical networks. Specifically, the logical router routes network traffic between two or more logical switches based on a set of routing tables. In some embodiments, a logical router is implemented in a single managed switching element while in other embodiments a logical router is implemented in several different managed switching elements in a distributed manner. A logical router of these different embodiments will be described in detail further below. Thelogical router105 routes the network traffic at the L3 between thelogical networks150 and155. Specifically, thelogical router105 routes the network traffic between the twological switches110 and115.
The machines120-145 are machines that are capable of exchanging data packets. For instance, each machine120-145 has a network interface controller (NIC) so that applications that execute on the machine120-145 can exchange data between them through thelogical switches110 and115 and thelogical router105.
Thelogical networks150 and155 are different in that the machines in each network use different L3 addresses. For instance, thelogical networks150 and155 are different IP subnets for two different departments of a company.
In operation, thelogical switches110 and115 and thelogical router105 function like switches and routers. For instance, thelogical switch110 routes data packets originating from one of the machines120-130 and heading to another of the machines120-130. When thelogical switch110 in thelogical network150 receives a data packet that is destined for one of the machines135-145 in thelogical network155, thelogical switch110 sends the packet to thelogical router105. Thelogical router105 then routes the packet, based on the information included in the header of the packet, to thelogical switch115. Thelogical switch115 then routes the packet to one of the machines135-145. Data packets originating from one of the machines135-145 are routed by thelogical switches110 and115 and thelogical router105 in a similar manner.
FIG. 1 illustrates a single logical router that routes data between the twological networks150 and155. One of ordinary skill in the art will recognize that there could be more than one logical routers involved in routing packets between two logical networks.
FIG. 2 conceptually illustrates aprocessing pipeline200 of some embodiments for processing network data through logical switches and logical routers. Specifically, theprocessing pipeline200 includes three stages205-215 for processing a data packet through alogical switch220, alogical router225, and then alogical switch230, respectively. This figure illustrates thelogical router225 and thelogical switches220 and230 in the top half of the figure and theprocessing pipeline200 in the bottom half of the figure.
Thelogical router225 is similar to thelogical router105 described above by reference toFIG. 1, in that thelogical router225 routes data packets between thelogical switches220 and220. Thelogical switches220 and230 are similar to thelogical switches110 and115. Thelogical switches220 and230 each forward the traffic at L2 for a logical network.
When thelogical switch220 receives a packet, thelogical switch220 performs stage205 (L2 processing) of thelogical processing pipeline200 in order to forward the packet in one logical network. When the packet is destined for another logical network, thelogical switch220 forwards the packet to thelogical router225. Thelogical router225 then performs stage210 (L3 processing) of thelogical processing pipeline200 on the packet in order to route the data at L3. Thelogical router225 sends this packet to another logical router (not shown) or, if thelogical router225 is coupled to thelogical switch230, thelogical router225 sends the packet to thelogical switch230 that would send the packet directly to the destination machine of the packet. Thelogical switch230, which directly sends the packet to the packet's destination, performs stage215 (L2 processing) of thelogical processing pipeline200 in order to forward the packet to the packet's destination.
In some embodiments, logical switches and logical routers are implemented by a set of managed switching elements (not shown). These managed switching elements of some embodiments implement the logical switches and logical routers by performing a logical processing pipeline such as thelogical processing pipeline200. The managed switching elements of some embodiments perform the logical processing pipelines based on flow entries in the managed switching elements. The flow entries (not shown) in the managed switching elements are configured by the network control system of some embodiments. More details of thelogical processing pipeline200 will be described further below.
The next three figures,FIGS. 3, 4, and 5 conceptually illustrates several implementations of logical switches and logical routers of some embodiments.FIGS. 3 and 4 illustrates two different implementations of centralized L3 routing whileFIG. 5 illustrates a distributed L3 routing.
FIG. 3 conceptually illustrates anetwork architecture300. Specifically,FIG. 3 illustrates that thelogical router225 is implemented in a single L3 router360 (e.g., a hardware router or a software router). TheL3 router360 routes the packets for different logical networks each of which includes several logical switches implemented in several different managed switching elements. This figure is horizontally divided into a left half and a right half that represent logical and physical implementations, respectively. This figure is also vertically divided into a bottom half and a top half that representlayer 2 andlayer 3, respectively.FIG. 3 illustrates thenetwork architecture300 includes theL3 router360 and managed switchingelements305,310,315, and320. This figure also illustrates that each of thelogical switches220 and230 is logically coupled to three VMs.
TheL3 router360 implements thelogical router225. TheL3 router360 routes packets between different logical networks that includelogical switches220 and230. TheL3 router360 routes the packets according toL3 entries335 that specify the manner in which the packets should be routed at L3. For instance, the L3 entries of some embodiments are entries (e.g., routes) in routing tables that specify that a packet that has a destination IP address that falls in a particular range of IP addresses should be sent out through a particular logical port of thelogical router225. In some embodiments, the logical ports of thelogical router225 are mapped to the ports of the L3 router and thelogical router225 generates the L3 entries based on the mappings. Mapping ports of a logical router to an L3 router that implements the logical router will be described further below.
The managed switching elements305-320 of some embodiments implement logical switches in a distributed manner. That is, a logical switch in these embodiments may be implemented across one or more of the managed switching elements305-320. For instance, thelogical switch220 may be implemented across the managed switchingelements305,310, and315 and thelogical switch230 may be implemented across the managed switchingelements305,315 and320. The six VMs362-374 logically coupled to thelogical switches220 and230 are coupled to the managed switching elements310-320 as shown.
The managed switching elements305-320 of some embodiments each forwards the packets according to L2 flow entries that specify the manner in which the packets should be forwarded at L2. For instance, the L2 flow entries may specify that a packet that has a particular destination MAC address should be sent out through a particular logical port of the logical switch. Each of the managed switching elements305-320 has a set of L2 flow entries340 (Flow entries340 for switching elements305-315 are not depicted for simplicity). The L2 flow entries for each managed switching elements are configured in the managed switching element by the controller cluster. Configuring managed switching elements by configuring L2 flows entries for the managed switching elements will be described in detail further below.
The managed switchingelement305 of some embodiments is a second-level managed switching element. A second-level managed switching element is a managed non-edge switching element, which, in contrast to an managed edge switching element, does not send and receive packets directly to and from the machines. A second-level managed switching element facilitates packet exchanges between non-edge managed switching elements and edge managed switching elements. A pool node and an extender, which are described in U.S. patent application Ser. No. 13/177,535, are also second-level managed switching elements. The managed switchingelement305 of some embodiments functions as an extender. That is, the managed switchingelement305 communicatively bridges remote managed networks (not shown) that are separated by one or more other networks (not shown).
The managed switchingelement305 of some embodiments is communicatively coupled to theL3 router360. When there are packets that need to be routed at L3, the managed switching elements310-320 send the packets to the managed switchingelement305 so that theL3 router360 routes the packets at L3. More details about a centralized logical router that is implemented in an L3 router will be described further below by reference toFIGS. 6-16.
FIG. 4 conceptually illustrates anetwork architecture400. Specifically,FIG. 4 illustrates that thelogical router225 is implemented in a managed switchingelement410. In contrast to thenetwork architecture300 in which theL3 router360 routes the packets at L3, the managed switchingelement410 routes packets at L3 in thenetwork architecture400. This figure is horizontally divided into a left half and a right half that represent logical and physical implementations, respectively. This figure is also vertically divided into a bottom half and a top half that representlayer 2 andlayer 3, respectively.
Thenetwork architecture400 is similar to thenetwork architecture300 except that thenetwork architecture400 does not include theL3 router360. The managed switchingelement410 implements thelogical router225. That is, the managed switchingelement410 routes packets between different logical networks that includelogical switches220 and230. The managed switchingelement410 of some embodiments routes the packets according toL3 entries405 that specify the manner in which the packets should be routed at L3. However, in contrast to theL3 entries335 of some embodiments, theL3 entries405 are not entries for routing tables. Rather, theL3 entries405 are flow entries. As described in U.S. patent application Ser. No. 13/177,535, a flow entry includes a qualifier and an action while the entries in routing tables are just lookup tables for finding the next hops for the packets. Also, the L3 flow entries may specify the manner in which to generate entries in the routing tables (not shown).
In addition to implementing a centralized logical router, the managed switchingelement410 of some embodiments implements one or more logical switches that are implemented across several managed switching elements. The managed switchingelement410 therefore has its own set of L2 flow entries340 (not depicted). In thearchitecture400, the managed switchingelements410 and310-320 together implement thelogical switches220 and230 in a distributed manner.
The managed switchingelement410 of some embodiments thus implements both a centralized logical router and logical switches. In other embodiments, implementation of a centralized logical router and logical switches may be separated into two or more managed switching elements. For instance, one managed switching element (not shown) may implement a centralized logical router using flow entries and another managed switching element (not shown) may implement logical switches based on flow entries in a distributed manner. More details about a centralized logical router that is implemented in a managed switching element based on flow entries will be described further below by reference toFIGS. 17-24.
FIG. 5 conceptually illustrates anetwork architecture500. Specifically,FIG. 5 illustrates that thelogical router225 is implemented in a distributed manner such that each of several managed switching elements routes packets at L3.FIG. 5 illustrates that thenetwork architecture500 includes four managed switching elements505-520.
The managed switching elements505-520 implement a logical router and several logical switches for several different logical networks. Each of the managed switching elements505-520 of some embodiments is an edge switching element. That is, the managed switching element has one or more machines that are coupled to the managed switching element. The machines that are coupled to the managed switching elements are also logically coupled to the logical switches. The machines that are coupled to a managed switching element may or may not be logically coupled to the same logical switch.
Each of the managed switching elements505-520 implements at least one logical router and at least one logical switch that will route and forward packets to and from the machines coupled to the managed switching element. In other words, when the managed switching element receives a packet from the machines coupled to the managed switching element, the managed switching element makes both logical forwarding decisions and logical routing decisions. Each of the managed switching elements505-520 makes the logical forwarding and routing decisions according to the L2 entries and L3 entries in thelogical flow entries550. Thelogical flow entries550 include a set ofL2 flow entries530 and a set ofL3 flow entries535. More details about a distributed logical router will be described further below by reference toFIGS. 25-30B.
FIGS. 6-16 illustrate a centralized logical router implemented in a router.FIG. 6 conceptually illustrates an example implementation of thelogical processing pipeline200 described above by reference toFIG. 2.FIG. 6 illustrates anetwork architecture600. In thenetwork architecture600, thelogical processing pipeline200 is performed by three managed switchingelements615,620, and625 and anL3 router635. In particular, theL2 processing205 and theL2 processing215 are performed in a distributed manner across managed switchingelements615,620, and625. TheL3 processing210 is performed by theL3 router635.FIG. 6 also illustratessource machine610 anddestination machine630.
The managed switchingelement615 is an edge switching element that directly receives the packets from a machine coupled to the edge switching element. The managed switchingelement615 receives packets from thesource machine610. When the managed switchingelement615 receives a packet from thesource machine610, the managed switchingelement615 performs a portion of theL2 processing205 on the packet in order to logically forward the packet.
There may be one or more managed switching elements (not shown) between the managed switchingelement615 and the managed switchingelement620. These managed switching elements have network constructs (e.g., PIFs, VIFs, etc.) to which the logical constructs (e.g., logical ports) of the logical switch220 (not shown inFIG. 6) are mapped.
When the packet is headed to thedestination machine630, which is in another logical network, the packet is forwarded to the managed switchingelement620. The managed switchingelement620 then performs the rest of theL2 processing205 and sends the packet to anL3 router635, which implements a centralized logical router (not shown).
Similar toL3 router360 described above by reference toFIG. 3, theL3 router635 is a hardware router or a software router of which the ports are mapped to the ports of a logical router. TheL3 router635 performs theL3 processing210 on the packet in order to logically route the packet. That is, theL3 router635 sends the packet to another logical router (not shown) or to the managed switchingelement620.
The managed switchingelement620 is a second-level managed switching element that functions as an extender in some embodiments. The managed switchingelement620 receives a packet from theL3 router635 and starts performing the L2 processing215 of thelogical processing pipeline200. There may be one of more managed switching elements (not shown) between the managed switchingelement620 and the managed switchingelement625. These managed switching elements have network constructs to which the logical constructs of the logical switch230 (not shown inFIG. 6) are mapped.
The managed switchingelement625 in the example receives the packet from the managed switchingelement620. The managed switchingelement625 performs the rest of theL2 processing215 on the packet in order to logically forward the packet. In this example, the managed switchingelement625 is also the switching element that directly sends the packet to thedestination machine630. However, there may be one or more managed switching elements (not shown) between the managed switchingelement625 and thedestination machine630. These managed switching elements have network constructs to which the logical constructs of the logical switch230 (not shown inFIG. 6) are mapped.
Although theL2 processing205 and theL2 processing215 are performed in a distributed manner in this example, theL2 processing205 and theL2 processing215 do not have to be performed in a distributed manner. For instance, the managed switchingelement615 may perform theentire L2 processing205 and the managed switchingelement625 may perform theentire L2 processing215. In such case, the managed switchingelement620 would just relay the packets between the L3 router and the managed switchingelements615 and625.
FIG. 7 conceptually illustrates thelogical processing pipeline200 of some embodiments for processing a packet through thelogical switch220, thelogical router225, and thelogical switch230. Specifically, this figure illustrates thelogical processing pipeline200 when performed in thenetwork architecture600 described above by reference toFIG. 6. As described above, in thenetwork architecture600, theL2 processing205, theL3 processing210, and theL2 processing215 are performed by the managed switchingelements615,620, and625 and theL3 router635.
TheL2 processing205, in some embodiments, includes eight stages705-740 for processing a packet through the logical switch220 (not shown inFIG. 7) in a logical network (not shown) that is implemented across the managed switchingelements615 and620. In some embodiments, the managed switchingelement615 that receives the packet performs a portion of theL2 processing205 when the managed switchingelement615 receives the packet. The managed switchingelement620 then performs the rest of theL2 processing205.
In some embodiments, a packet includes a header and a payload. The header includes, in some embodiments, a set of fields that contains information used for routing the packet through a network. Logical switches and logical routers may determine switching/routing decisions based on the information contained in the header fields and may, in some cases, modify some or all of the header fields.
In thestage705 of theL2 processing205, ingress context mapping is performed on the packet to determine the logical context of the packet. In some embodiments, thestage705 is performed when thelogical switch220 receives the packet (e.g., the packet is initially received by the managed switching element615). A logical context, in some embodiments, represents the state of the packet with respect to the logical switch. The logical context may, for example, specify the logical switch to which the packet belongs, the logical port of the logical switch through which the packet was received, the logical port of the logical switch through which the packet is to be transmitted, the stage of the logical forwarding plane of the logical switch the packet is at, etc.
Some embodiments determine the logical context of a packet based on the source MAC address of the packet (i.e., the machine from which the packet was sent). Some embodiments perform the logical context lookup based on the source MAC address of the packet and the inport (i.e., ingress port) of the packet (i.e., the port of the managed switchingelement615 through which the packet was received). Other embodiments may use other fields in the packet's header (e.g., MPLS header, VLAN id, etc.) for determining the logical context of the packet.
After thefirst stage705 is performed, some embodiments store the information that represents the logical context in one or more fields of the packet's header. These fields may also be referred to as a logical context tag or a logical context ID. Furthermore, the logical context tag may coincide with one or more known header fields (e.g., the VLAN id field) in some embodiments. As such, these embodiments do not utilize the known header field or its accompanying features in the manner that the header field is defined to be used. Alternatively, some embodiments store the information that represents the logical context as metadata that is associated with (instead of stored in the packet itself) and passed along with the packet.
In some embodiments, thesecond stage710 is defined for thelogical switch220. In some such embodiments, thestage710 operates on the packet's logical context to determine ingress access control of the packet with respect to the logical switch. For example, an ingress ACL is applied to the packet to control the packet's access to the logical switch when the logical switch receives the packet. Based on the ingress ACL defined for the logical switch, the packet may be further processed (e.g., by the stage715) or the packet may be dropped, for example.
In thethird stage715 of theL2 processing205, an L2 forwarding is performed on the packet in the context of the logical switch. In some embodiments, thethird stage715 operates on the packet's logical context to process and forward the packet with respect to thelogical switch220. For instance, some embodiments define a L2 forwarding table or L2 forwarding entries for processing the packet atlayer 2.
Moreover, when the packet's destination is in another logical network (i.e., when the packet's destination logical network is different than the logical network whose traffic is processed by the logical switch220), thelogical switch220 sends the packet to thelogical router225, which will then perform theL3 processing210 in order to route the packet to the destination logical network. Thus, at thethird stage715, the managed switchingelement615 of some embodiments determines that the packet should be forwarded to thelogical router225 through a logical port (not shown) of the logical switch that is associated with thelogical router225. In other embodiments, the managed switchingelement615 does not necessarily determine whether the packet should be forwarded to thelogical router225. Rather, the packet would have an address of a port of thelogical router225 as a destination address and the managed switchingelement615 forwards this packet through the logical port of the logical switch according to the destination address.
At thefourth stage720, egress context mapping is performed to identify a physical result that corresponds to the result of the logical forwarding of the packet. For example, the logical processing of the packet may specify that the packet is to be sent out of one or more logical ports (e.g., a logical egress port) of thelogical switch220. As such, the egress context mapping operation identifies a physical port(s) of one or more of the managed switching elements (including the managed switchingelements615 and620) that corresponds to the particular logical port of thelogical switch220. The managed switchingelement615 determines that the physical port (e.g. a VIF) to which the logical port determined at theprevious stage715 is mapped is a port (not shown) of the managed switchingelement620.
Thefifth stage725 of theL2 processing205 performs a physical mapping based on the egress context mapping performed at thefourth stage720. In some embodiments, the physical mapping determines operations for sending the packet towards the physical port that was determined in thefourth stage720. For example, the physical mapping of some embodiments determines one or more queues (not shown) associated with one or more ports of the set of ports (not shown) of the managed switchingelement615 that is performing theL2 processing205 through which to send the packet in order for the packet to reach the physical port(s) determined in thefifth stage725. This way, the managed switching elements can forward the packet along the correct path in the network for the packet to reach the determined physical port(s).
As shown, thesixth stage730 of theL2 processing205 is performed by the managed switchingelement620. Thesixth stage730 is similar to thefirst stage705. Thestage730 is performed when the managed switchingelement620 receives the packet. At thestage730, the managed switchingelement620 looks up the logical context of the packet and determines that L2 egress access control is left to be performed.
Theseventh stage735 of some embodiments is defined for thelogical switch220. Theseventh stage735 of some such embodiments operates on the packet's logical context to determine egress access control of the packet with respect to the logical switch. For instance, an egress ACL may be applied to the packet to control the packet's access out of thelogical switch220 after logical forwarding has been performed on the packet. Based on the egress ACL defined for the logical switch, the packet may be further processed (e.g., sent out of a logical port of the logical switch or sent to a dispatch port for further processing) or the packet may be dropped, for example.
Theeighth stage740 is similar to thefifth stage725. At theeighth stage740, the managed switchingelement620 determines a specific physical port (not shown) of the managed switchingelement620 to which the logical egress port of thelogical switch220 is mapped.
TheL3 processing210 includes six stages745-761 for processing a packet through the logical switch220 (not shown inFIG. 7) that is implemented by theL3 router635. As mentioned above, L3 processing involves performing a set of logical routing lookups for determining where to route the packet through alayer 3 network.
Thefirst stage745 performs a logical ingress ACL lookup for determining access control when thelogical router225 receives the packet (i.e., when theL3 router635 which implements thelogical router225 receives the packet). Thenext stage746 performs network address translation (NAT) on the packet. In particular, thestage746 performs destination NAT (DNAT) to revert the destination address of the packet back to the real address of the destination machine that is hidden from the source machine of the packet. Thisstage746 is performed when DNAT is enabled.
Thenext stage750 performs a logical L3 routing for determining one or more logical ports to send the packet through thelayer 3 network based on the L3 addresses (e.g., destination IP address) of the packet and routing tables (e.g., containing L3 entries). Since thelogical router225 is implemented by theL3 router635, the routing tables are configured in theL3 router635.
At thefourth stage755, theL3 router635 of some embodiments also performs source NAT (SNAT) on the packet. For instance, theL3 router635 replaces the source IP address of the packet with a different IP address in order to hide the source IP address when the source NAT is enabled.
The fifth stage760 performs logical L3 egress ACL lookups for determining access control before thelogical router225 routes the packet out of thelogical router225 through the port determined in thestage740. The L3 egress ACL lookups are performed based on the L3 addresses (e.g., source and destination IP addresses) of the packet.
Thesixth stage761 performs address resolution in order to translate the destination L3 address (e.g., a destination IP address) into a destination L2 address (e.g., a destination MAC address). In some embodiments, theL3 router635 uses a standard address resolution (e.g., by sending out ARP requests or looking up ARP cache) to find the destination L2 address that corresponds to the destination IP address.
When thelogical router225 is not coupled to the destination logical network, thelogical switch220 sends the packet to another logical router network towards the destination logical network. When thelogical router225 is coupled to the destination logical network, thelogical switch220 routes the packet to the destination logical network (i.e., the logical switch that forwards the packet for the destination logical network).
TheL2 processing215, in some embodiments, includes eight stages765-798 for processing the packet through thelogical switch230 in another logical network (not shown inFIG. 7) that is implemented across the managed switchingelements620 and625. In some embodiments, the managed switchingelement625 in the managed network that receives the packet performs theL2 processing215 when the managed switchingelement625 receives the packet from the managed switchingelement620. The stages765-798 are similar to the stage705-740, respectively, except that the stage765-798 are performed by the logical switch230 (i.e., by the managed switchingelements620 and625 that implement the logical switch230). That is, the stages765-798 are performed to forward the packet received from theL3 router635 to the destination through the managed switchingelements620 and625.
FIG. 8 conceptually illustrates anexample network architecture800 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture800 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure anL3 router860. Also illustrated in the bottom half are a second-level managed switchingelement810, managed switchingelements815 and820 which are running inhosts890,880, and885 (e.g., machines operated by operating systems such as Windows™ and Linux™), respectively. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
In this example, thelogical switch220 forwards data packets between thelogical router225,VM 1, andVM 2. Thelogical switch230 forwards data packets between thelogical router225,VM 3, andVM 4. As mentioned above, thelogical router225 routes data packets between thelogical switches220 and230 and possibly other logical routers and switches (not shown). Thelogical switches220 and230 and thelogical router225 are logically coupled through logical ports (not shown) and exchange packets through the logical ports. These logical ports are mapped to physical ports of theL3 router860 and the managed switchingelements810,815 and820.
In some embodiments, each of thelogical switches220 and230 is implemented across the managed switchingelements815 and820 and possibly other managed switching elements (not shown). In some embodiments, thelogical router225 is implemented in theL3 router860 which is communicatively coupled to the managed switchingelement810.
In this example, the managed switchingelements810,815 and820 are software switching elements running inhosts890,880 and885, respectively. The managed switchingelements810,815 and820 have flow entries which implement thelogical switches220 and230. Using these flow entries, the managed switchingelements815 and820 route network data (e.g., packets) between network elements in the network that are coupled to the managed switchingelements810,815 and820. For instance, the managed switchingelement815 routes network data betweenVMs 1 and 3, and the second-level managed switchingelement810. Similarly, the managed switchingelement820 routes network data betweenVMs 2 and 4, and the second-level managed switchingelement810. As shown, the managed switchingelements815 and820 each have three ports (depicted as numbered squares) through which to exchange data packets with the network elements that are coupled to the managed switchingelements815 and820.
The managed switchingelement810 is similar to the managed switchingelement305 described above by reference toFIG. 3 in that the managed switchingelement810 is a second-level managed switching element that functions as an extender. The managed switchingelement810 runs in the same host as theL3 router860, which in this example is a software router.
In some embodiments, tunnels are established by the network control system (not shown) to facilitate communication between the network elements. For instance, the managed switchingelement810 is coupled to the managed switchingelement815, which runs in thehost880, through a tunnel that terminates atport 2 of the managed switchingelement815 as shown. Similarly, the managed switchingelement810 is coupled to the managed switchingelement820 through a tunnel that terminates atport 1 of the managed switchingelement820.
Different types of tunneling protocols are supported in different embodiments. Examples of tunneling protocols include control and provisioning of wireless access points (CAPWAP), generic route encapsulation (GRE), GRE Internet Protocol Security (IPsec), among other types of tunneling protocols.
In this example, each of thehosts880 and885 includes a managed switching element and several VMs as shown. VMs 1-4 are virtual machines that are each assigned a set of network addresses (e.g., a MAC address for L2, an IP address for L3, etc.) and can send and receive network data to and from other network elements. The VMs are managed by hypervisors (not shown) running on thehosts880 and885.
Several example data exchanges through thenetwork architecture800 will now be described. WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 2 that is also coupled to the samelogical switch220, the packet is first sent to the managed switchingelement815. The managed switchingelement815 then performs theL2 processing205 on the packet because the managed switchingelement815 is the edge switching element that receives the packet fromVM 1. The result of theL2 processing205 on this packet would indicate that the packet should be sent to the managed switchingelement820 to get toVM 2 throughport 4 of the managed switchingelement820. BecauseVMs 1 and 2 are in the same logical network and therefore L3 routing for the packet is not necessary, no L3 processing needs to be performed on this packet. The packet is then sent to the managed switchingelement820 via the second-level managed switchingelement810 which is bridging between the managed switchingelement815 and the managed switchingelement820. The packet reachesVM 2 throughport 4 of the managed switchingelement820.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 3 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement815. The managed switchingelement815 performs a portion of L2 processing on the packet. However, because the packet is sent from one logical network to another (i.e., the logical L3 destination address of the packet is for another logical network), an L3 processing needs to be performed on this packet.
The managed switchingelement815 sends the packet to the second-level managed switchingelement810 so that the managed switchingelement810 performs the rest of the L2 processing on the packet to forward the packet to theL3 router860. The result of L3 processing performed at theL3 router860 would indicate that the packet should be sent back to the managed switchingelement810. The managed switchingelement810 then performs a portion of another L2 processing and forwards the packet received from theL3 router860 back to the managed switchingelement815. The managed switchingelement815 performs theL2 processing215 on the packet received from the managed switchingelement810 and the result of this L2 processing would indicate that the packet should be sent toVM 3 throughport 5 of the managed switchingelement815.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement815. The managed switchingelement815 performs theL2 processing205 on the packet. However, because the packet is sent from one logical network to another, an L3 processing needs to be performed.
The managed switchingelement815 sends the packet to theL3 router860 via the managed switchingelement810 so that theL3 router860 performs theL3 processing210 on the packet. The result of theL3 processing210 performed at theL3 router860 would indicate that the packet should be sent to the managed switchingelement820. The managed switchingelement810 then performs a portion of L2 processing on the packet received from the managed switching element and the result of this L2 processing would indicate that the packet should be sent toVM 4 through the managed switchingelement820. The managed switchingelement820 performs the rest of the L2 processing to determine that the packet should be sent toVM 4 throughport 5 of the managed switchingelement820.
FIG. 9 conceptually illustrates anexample network architecture900 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture900 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure theL3 router860. Also illustrated in the bottom half are a second-level managed switchingelement905, the second-level managed switchingelement810, and managed switchingelements815 and820 which are running inhosts910,890,880, and885, respectively. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
Thenetwork architecture900 is similar to thenetwork architecture800 except that thenetwork architecture900 additionally includes the managed switchingelement905 which runs in thehost910. The managed switchingelement905 of some embodiments is a second-level managed switching element that functions as a pool node.
In some embodiments, tunnels are established by the network control system (not shown) to facilitate communication between the network elements. For instance, the managed switchingelement815 in this example is coupled to the managed switchingelement905, which runs in thehost910, through a tunnel that terminates atport 1 of the managed switchingelement815 as shown. Similarly, the managed switchingelement820 is coupled to the managed switchingelement905 through a tunnel that terminates atport 2 of the managed switchingelements820. Also, the managed switchingelements905 and810 are coupled through a tunnel as shown.
Thelogical router225 and thelogical switches220 and230 are implemented in theL3 router860 and the managed switchingelements810,815, and820 as described by reference toFIG. 8 above, except that the second-level managed switchingelement905 is involved in the data packet exchange. That is, the managed switchingelements815 and810 exchange packets through the managed switchingelement905.
FIG. 10 conceptually illustrates anexample network architecture1000 of some embodiments which implements thelogical router225 andlogical switches220 and230. Thenetwork architecture1000 is similar to thenetwork architecture800 except that there is a tunnel established between the managed switchingelement810 and the managed switchingelement820. This figure illustrates that thenetwork architecture1000 of some embodiments is a mixture of thenetwork architecture800 and thenetwork architecture900. That is, some managed edge switching elements have tunnels to a second-level managed switching element that is coupled to a centralized L3 router while other managed edge switching elements have to go through a second-level managed switching element that functions as a pool node in order to exchange packets with a second-level managed switching element that is coupled to the centralized L3 router.
FIG. 11 conceptually illustrates an example architecture of thehost890 of some embodiments that includes the managed switchingelement810 and the L3 router860 (not shown). Specifically, this figure illustrates that theL3 router860 is configured in anamespace1120 of thehost890. Thehost890, in some embodiments, is a machine that is managed by an operating system (e.g., Linux) that is capable of creating namespaces and virtual machines. As shown, thehost890 in this example includes a managed switchingelement810, thenamespace1120, and a NIC845. This figure also illustrates acontroller cluster1105.
Thecontroller cluster1105 is a set of network controllers or controller instances that manage the network elements, including the managed switchingelement810. The managed switchingelement810 in this example is a software switching element implemented in thehost890 that includes a user space1112 and akernel1110. The managed switchingelement810 includes acontrol daemon1115 running in theuser space1115; andcontroller patch1130 and abridge1135 running in thekernel1110. Theuser space1115 and thekernel1110, in some embodiments, is of an operating system for thehost890 while in other embodiments theuser space1115 and thekernel1110 is of a virtual machine that is running on thehost890.
In some embodiments, thecontroller cluster1105 communicates with a control daemon1115 (e.g., by using OpenFlow protocol or another communication protocol), which, in some embodiments, is an application running in the background of the user space1112. Thecontrol daemon1115 communicates with thecontroller cluster1105 in order to process and route packets that the managed switchingelement810 receives. Specifically, thecontrol daemon1115, in some embodiments, receives configuration information from thecontroller cluster1105 and configures thecontroller patch1130. For example, thecontrol daemon1115 receives commands from thecontroller cluster1105 regarding operations for processing and routing packets that the managed switchingelement810 receives.
Thecontrol daemon1115 also receives configuration information for thecontroller patch1130 to set up ports (not shown) connecting to the logical router (not shown) implemented in thenamespace1120 such that the logical router populates the routing tables and other tables with appropriate entries.
Thecontroller patch1130 is a module that runs in thekernel1110. In some embodiments, thecontrol daemon1115 configures thecontroller patch1130. When configured, thecontroller patch1130 contains rules (e.g., flow entries) regarding processing and forwarding the packets to receive. Thecontroller patch1130 of some embodiments also creates a set of ports (e.g., VIFs) to exchange packets with thenamespace1120.
Thecontroller patch1130 receives packets from anetwork stack1150 of thekernel1110 or from thebridge1135. Thecontroller patch1130 determines which namespace to which to send the packets based on the rules regarding processing and routing the packets. Thecontroller patch1130 also receives packets from thenamespace1120 and sends the packets to thenetwork stack1150 or thebridge1135 based on the rules. More details about architecture of a managed switching element are described in U.S. patent application Ser. No. 13/177,535.
The namespace1120 (e.g., Linux namespace) is a container created in thehost890. Thenamespace1120 can implement network stacks, network devices, network addresses, routing tables, network address translation tables, network caches, etc. (not all of these are shown inFIG. 11). Thenamespace1120 thus can implement a logical router when the namespace is configured to handle packets with logical source or destination addresses. Thenamespace1120 can be configured to handle such packets, for example, by configuring the routing tables1155 of the namespace. In some embodiments, thenamespace1120 populates the routing tables1155 as thenamespace1120 connects to the managed switchingelement810 and exchanges packets (i.e., dynamic routing). In other embodiments, thecontroller cluster1105 may directly configure the routing tables1155 by populating the routing tables1155 with routes.
Moreover, the namespace, in some embodiments, also performs network address translation (NAT) on the packets that the namespaces route. For instance, when the namespace changes the source network address of the received packet into another network address (i.e., performs source NAT).
Thebridge1135 routes network data between thenetwork stack1150 and network hosts external to the host (i.e., network data received through the NIC1145). As shown, thebridge1135 routes network data between thenetwork stack1150 and theNIC1145 and between thecontroller patch1130 and theNIC1145. Thebridge1135 of some embodiments performs standard L2 packet learning and routing.
Thenetwork stack1150 can receive packets from network hosts external to the managed switchingelement810 through theNIC1145. Thenetwork stack1150 then sends the packets to thecontroller patch1130. In some cases, the packets are received from network hosts external to the managed switching element through tunnels. In some embodiments, the tunnels terminate at thenetwork stack1150. Thus, when thenetwork stack1150 receives a packet through a tunnel, thenetwork stack1150 unwraps the tunnel header (i.e., decapsulates the payload) and sends the unwrapped packet to thecontroller patch1130.
An example operation of the managed switchingelement810 and thenamespace1120 will now be described. In this example, tunnels are established between the managed switchingelement810 and the managed switchingelements815 and820 (not shown inFIG. 11) that are external to thehost890. That is, the managed switchingelements810,815, and820 are connected through the tunnels as illustrated inFIG. 8. The tunnels terminate at thenetwork stack1150.
The managed switchingelement815 sends a packet, sent by VM1 toVM 4, to the managed switchingelement810. The packet is received by theNIC1145 and then is sent to thebridge1135. Based on the information in the packet header, thebridge1135 determines that the packet is sent over the established tunnel and sends the packet to thenetwork stack1150. Thenetwork stack1150 unwraps the tunnel header and sends the unwrapped packet to thecontroller patch1130.
According to the rules that thecontroller patch1130 has, thecontroller patch1130 sends the packet to thenamespace1120 because the packet is sent from one logical network to another logical network. For instance, the rules may say a packet with certain destination MAC address should be sent to thenamespace1120. In some cases, thecontroller patch1130 removes logical context from the packet before sending the packet to the namespace. Thenamespace1120 then performs an L3 processing on the packet to route the packet between the two logical networks.
By performing the L3 processing, thenamespace1120 determines that the packet should be sent to thecontroller patch1130 because the destination network layer address should go to a logical switch that belongs to the destination logical network. Thecontroller patch1130 receives the packet and sends the packet through thenetwork stack1150, thebridge1135, and theNIC1145 over the tunnel to the managed switchingelement820 that implements the logical switch that belongs to the destination logical network.
As described above, some embodiments implement theL3 router860 in thenamespace1120. Other embodiments, however, may implement theL3 router860 in a VM that runs on thehost890.
FIG. 12 conceptually illustrates an example implementation of logical switches and logical routers in managed switching elements and L3 routers. Specifically, this figure illustrates implementation of thelogical router225 and thelogical switches220 and230 in thehost890, which includes the second-level managed switchingelement810 and theL3 router860, and the managed switchingelements815 and820. The figure illustrates in the left half of the figure, thelogical router225 and thelogical switches220 and230. This figure illustrates in the right half of the figure, the second-level managed switchingelement810, and managed switchingelements815 and820. The figure illustrates VMs 1-4 in both the right and the left halves of the figure. For simplicity, this figure does not illustrate all the components of the managed switching element, e.g., thenetwork stack1150.
Thelogical switches220 and230 and thelogical router225 are logically coupled through logical ports. As shown, a logical port X of thelogical switch220 is coupled to thelogical port 1 of thelogical router225. Similarly, a logical port Y of thelogical switch230 is coupled to thelogical port 2 of thelogical router225. Thelogical switches220 and230 exchange data packets with thelogical router225 through these logical ports. Also, in this example, thelogical switch220 associates the logical port X with a MAC address 01:01:01:01:01:01 which is a MAC address of thelogical port 1 of thelogical router225. When thelogical switch220 receives a packet that needs an L3 processing, thelogical switch220 sends the packet out to thelogical router225 through port X. Similarly, thelogical switch230 associates the logical port Y with a MAC address 01:01:01:01:01:02 which is a MAC address of thelogical port 2 of thelogical router225. When thelogical switch230 receives a packet that needs an L3 processing, thelogical switch230 sends the packet out to thelogical router225 through port Y.
In this example, the controller cluster1105 (not shown inFIG. 12) configures the managed switchingelement810 such thatport 1 of the managed switchingelement810 is associated with the same MAC address, 01:01:01:01:01:01, that is associated with port X of thelogical switch220. Accordingly, when the managed switchingelement810 receives a packet that has this MAC address as destination MAC address, the managed switchingelement810 sends the packet out to the L3 router860 (configured in the namespace1120) through theport 1 of the managed switchingelement810. As such, port X of thelogical switch220 is mapped toport 1 of the managed switchingelement810.
Similarly,port 2 of the managed switchingelement810 is associated with the same MAC address, 01:01:01:01:01:02, that is associated with port Y of thelogical switch230. Accordingly, when the managed switchingelement810 receives a packet that has this MAC address as destination MAC address, the managed switchingelement810 sends the packet out to theL3 router860 through theport 2 of the managed switchingelement810. As such, port Y of thelogical switch230 is mapped toport 2 of the managed switchingelement810.
In this example, thelogical router225 haslogical ports 1 and 2 and other logical ports (not shown).Port 1 of thelogical router225 is associated with an IP address 1.1.1.1/24, which represents a subnet behindport 1. That is, when thelogical router225 receives a packet to route and the packet has a destination IP address, e.g., 1.1.1.10, thelogical router225 sends this packet towards the destination logical network (e.g., a logical subnet) throughport 1.
Similarly,port 2 of thelogical router225 in this example is associated with an IP address 1.1.2.1/24, which represents a subnet behindport 2. Thelogical router225 sends a packet with a destination IP address, e.g., 1.1.2.10, to the destination logical network throughport 2.
In this example, theL3 router860 implements thelogical router225 by populating theL3 router860's routing tables (not shown) with routes. In some embodiments, theL3 router860 populates its routing tables when the managed switchingelement810 establishes connection with theL3 router860 and send a packet. For instance, when the L3 router receives an initial packet from the managed switching element, theL3 router860 finds out that packets that have the initial packet's source address as destination addresses should be sent to the managed switchingelement810. The L3 router may also perform a standard address resolution (e.g., by sending out ARP requests) to find out where to send the initial packet. TheL3 router860 will store these “routes” in the routing tables and look up these tables when making routing decisions for the packets that the L3 router receives subsequently. Other L3 routers (not shown) may populate their routing tables in a similar manner.
In other embodiments, the controller cluster configures the routing table of theL3 router860 such thatport 1 of theL3 router860 is associated with the same IP address that is associated withport 1 of thelogical router225. Similarly,port 2 of theL3 router860 is associated with the same IP address that is associated withport 2 of thelogical router225. In a similar manner, another logical switch (not shown) may be implemented in another L3 router (not shown) of the managed switching element. In some of these embodiments, the control cluster may employ one or more routing protocols to configure the L3 router.
FIGS. 13A-13C conceptually illustrate an example operation of thelogical switches220 and230, thelogical router225 implemented in the managed switchingelements810,815 and820 and theL3 router860 described above by reference toFIG. 12. Specifically,FIG. 13A-13C illustrate how a packet sent fromVM 1 toVM 4 reachesVM 4.
WhenVM 1 that is coupled to thelogical switch220 sends apacket1330 toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement815 throughport 4 of the managed switchingelement815. The managed switchingelement815 performs an L2 processing on the packet.
As shown in the top half ofFIG. 13A, the managed switchingelement815 includes a forwarding table that includes rules (e.g., flow entries) for processing and forwarding thepacket1330. When the managed switchingelement815 receives thepacket1330 fromVM 1 throughport 4 of the managed switchingelement815, the managed switchingelement815 begins processing thepacket1330 based on the forwarding tables of the managed switchingelement815. In this example, thepacket1330 has a destination IP address of 1.1.2.10, which is the IP address ofVM 4. Thepacket1330's source IP address is 1.1.1.10. Thepacket1330 also hasVM 1's MAC address as a source MAC address and the MAC address of the logical port 1 (i.e., 01:01:01:01:01:01) of thelogical router225 as a destination MAC addresses.
The managed switchingelement815 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping of thestage1340. Therecord 1 identifies thepacket1330's logical context based on the inport, which is theport 4 through which thepacket1330 is received fromVM 1. In addition, therecord 1 specifies that the managed switchingelement815 store the logical context of thepacket1330 in a set of fields (e.g., a VLAN id field) of thepacket1330's header in some embodiments. In other embodiments, the managed switchingelement815 stores the logical context (i.e., the logical switch to which the packet belongs as well as the logical ingress port of that logical switch) in a register, or meta field, of the switch, rather than in the packet. Therecord 1 also specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). A dispatch port is described in U.S. patent application Ser. No. 13/177,535.
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement815 identifies a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the ingress ACL of thestage1342. In this example, therecord 2 allows thepacket1330 to be further processed (i.e., thepacket1330 can get through the ingress port of the logical switch220) and, thus, specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement815 store the logical context (i.e., thepacket1330 has been processed by thesecond stage1342 of the processing pipeline1300) of thepacket1330 in the set of fields of thepacket1330's header.
Next, the managed switchingelement815 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 3 (referred to as “record 3”) in the forwarding tables that implements the logical L2 forwarding of thestage1344. Therecord 3 specifies that a packet with the MAC address of thelogical port 1 of thelogical router225 as a destination MAC address is to be sent to the logical port X of thelogical switch220.
Therecord 3 also specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). Also, therecord 3 specifies that the managed switchingelement815 store the logical context (i.e., thepacket1330 has been processed by thethird stage1344 of the processing pipeline1300) in the set of fields of thepacket1330's header.
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement815 identifies a record indicated by an encircled 4 (referred to as “record 4”) in the forwarding tables that implements the context mapping of thestage1346. In this example, therecord 4 identifiesport 1 of the managed switchingelement810, to whichport 1 of theL3 router860 is coupled, as the port that corresponds to the logical port X of thelogical switch220 to which thepacket1330 is to be forwarded. Therecord 4 additionally specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement815 then identifies a record indicated by an encircled 5 (referred to as “record 5”) in the forwarding tables that implements the physical mapping of thestage1348. Therecord 5 specifies that thepacket1330 is to be sent throughport 1 of the managed switchingelement815 in order for thepacket1330 to reach the managed switchingelement810. In this case, the managed switchingelement815 is to send thepacket1330 out of theport 1 of managed switchingelement815 that is coupled to the managed switchingelement810.
As shown in the bottom half ofFIG. 13A, the managed switchingelement810 includes a forwarding table that includes rules (e.g., flow entries) for processing and routing thepacket1330. When the managed switchingelement810 receives thepacket1330 from the managed switchingelement815, the managed switchingelement810 begins processing thepacket1330 based on the forwarding tables of the managed switchingelement810. The managed switchingelement810 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping of thestage1350. Therecord 1 identifies thepacket1330's logical context based on the logical context that is stored in thepacket1330's header. The logical context specifies that thepacket1330 has been processed by the second andthird stages1342 and1344, which were performed by the managed switchingelement815. As such, therecord 1 specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port).
Next, the managed switchingelement810 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the egress ACL of thestage1352. In this example, therecord 2 allows thepacket1330 to be further processed (e.g., thepacket1330 can get out of thelogical switch220 through port “X” of the logical switch220) and, thus, specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement810 store the logical context (i.e., thepacket1330 has been processed by thestage1352 of the processing pipeline1300) of thepacket1330 in the set of fields of thepacket1330's header.
Next, the managed switchingelement810 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 3 (referred to as “record 3”) in the forwarding tables that implements the physical mapping of thestage1354. Therecord 3 specifies the port of the managed switchingelement810 through which thepacket1330 is to be sent in order for thepacket1330 to reach theL3 router860. In this case, the managed switchingelement810 is to send thepacket1330 out ofport 1 of managed switchingelement810 that is coupled to theport 1 of theL3 router860. In some embodiments, the managed switchingelement810 removes the logical context from thepacket1330 before sending the packet to theL3 router860.
As shown in the top half ofFIG. 13B, theL3 router860 includes an ingress ACL table, a routing table, and an egress ACL table that includes entries for processing and routing thepacket1330. When theL3 router860 receives thepacket1330 from the managed switchingelement810, theL3 router860 begins processing thepacket1330 based on these tables of theL3 router860. TheL3 router860 identifies an entry indicated by an encircled 1 (referred to as “entry 1”) in the ingress ACL table that implements L3 ingress ACL by specifying that theL3 router860 should accept the packet based on the information in the header of thepacket1330. TheL3 router860 then identifies an entry indicated by an encircled 2 (referred to as “entry 2”) in the routing table that implements L3 routing558 by specifying that thepacket1330 with its destination IP address (i.e., 1.1.2.10) should be sent to thelogical switch230 throughport 2 of thelogical router225. TheL3 router860 then identifies an entry indicated by an encircled 3 (referred to as “entry 3”) in the egress ACL table that implements L3 egress ACL by specifying that theL3 router860 can send the packet out throughport 2 of thelogical router225 based on the information in the header of thepacket1330. Also, theL3 router860 rewrites the source MAC address for thepacket1330 to the MAC address ofport 2 of the L3 router860 (i.e., 01:01:01:01:01:02).
TheL3 router860 then performs an address resolution to translate the destination IP address into the destination MAC address. In this example, theL3 router860 looks up an ARP cache to find the destination MAC address to which the destination IP address is mapped. TheL3 router860 may send out ARP requests if the ARP cache does not have a corresponding MAC address for the destination IP address. The destination IP address would be resolved to the MAC address ofVM 4. TheL3 router860 then rewrites the destination MAC of thepacket1330 using the MAC address to which the destination IP address is resolved. TheL3 router860 would send thepacket1330 to thelogical switch230 through thelogical port 2 of theL3 router860 based on the new destination MAC address.
As shown in the bottom half ofFIG. 13B, the managed switchingelement810 includes a forwarding table that includes rules (e.g., flow entries) for processing and forwarding thepacket1330. When the managed switchingelement810 receives thepacket1330 from theL3 router860 throughport 2 of the managed switchingelement810, the managed switchingelement810 begins processing thepacket1330 based on the forwarding tables of the managed switchingelement810. The managed switchingelement810 identifies a record indicated by an encircled 4 (referred to as “record 4”) in the forwarding tables that implements the context mapping of thestage1362. Therecord 4 identifies thepacket1330's logical context based on the inport, which is theport 2 through which thepacket1330 is received from theL3 router860. In addition, therecord 4 specifies that the managed switchingelement810 store the logical context of thepacket1330 in a set of fields (e.g., a VLAN id field) of thepacket1330's header. Therecord 4 also specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement810 identifies a record indicated by an encircled 5 (referred to as “record 5”) in the forwarding tables that implements the ingress ACL of thestage1364. In this example, therecord 5 allows thepacket1330 to be further processed and, thus, specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). In addition, therecord 5 specifies that the managed switchingelement810 store the logical context (i.e., thepacket1330 has been processed by thestage1362 of the processing pipeline1300) of thepacket1330 in the set of fields of thepacket1330's header.
Next, the managed switchingelement810 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 6 (referred to as “record 6”) in the forwarding tables that implements the logical L2 forwarding of thestage1366. Therecord 6 specifies that a packet with the MAC address ofVM 4 as the destination MAC address should be forwarded through the logical port (not shown) of thelogical switch230.
Therecord 6 also specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). Also, therecord 6 specifies that the managed switchingelement810 store the logical context (i.e., thepacket1330 has been processed by thestage1366 of the processing pipeline1300) in the set of fields of thepacket1330's header.
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement810 identifies a record indicated by an encircled 7 (referred to as “record 7”) in the forwarding tables that implements the context mapping of thestage1368. In this example, the record 7 identifiesport 5 of the managed switchingelement820 to whichVM 4 is coupled as the port that corresponds to the logical port (determined at stage1366) of thelogical switch230 to which thepacket1330 is to be forwarded. The record 7 additionally specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket1330's header, the managed switchingelement810 then identifies a record indicated by an encircled 8 (referred to as “record 8”) in the forwarding tables that implements the physical mapping of thestage1370. Therecord 8 specifies a port (not shown) of the managed switchingelement810 through which thepacket1330 is to be sent in order for thepacket1330 to reach the managed switchingelement820. In this case, the managed switchingelement810 is to send thepacket1330 out of the port of managed switchingelement810 that is coupled to the managed switchingelement820.
As shown inFIG. 13C, the managed switchingelement820 includes a forwarding table that includes rules (e.g., flow entries) for processing and routing thepacket1330. When the managed switchingelement820 receives thepacket1330 from the managed switchingelement810, the managed switchingelement820 begins processing thepacket1330 based on the forwarding tables of the managed switchingelement820. The managed switchingelement820 identifies a record indicated by an encircled 4 (referred to as “record 4”) in the forwarding tables that implements the context mapping of thestage1372. Therecord 4 identifies thepacket1330's logical context based on the logical context that is stored in thepacket1330's header. The logical context specifies that thepacket1330 has been processed by thestages1364 and1366, which were performed by the managed switchingelement810. As such, therecord 4 specifies that thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port).
Next, the managed switchingelement820 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 5 (referred to as “record 5”) in the forwarding tables that implements the egress ACL of thestage1374. In this example, therecord 5 allows thepacket1330 to be further processed and, thus, specifies thepacket1330 be further processed by the forwarding tables (e.g., by sending thepacket1330 to a dispatch port). In addition, therecord 5 specifies that the managed switchingelement820 store the logical context (i.e., thepacket1330 has been processed by thestage1374 of the processing pipeline1300) of thepacket1330 in the set of fields of thepacket1330's header.
Next, the managed switchingelement820 identifies, based on the logical context and/or other fields stored in thepacket1330's header, a record indicated by an encircled 6 (referred to as “record 6”) in the forwarding tables that implements the physical mapping of thestage1376. Therecord 6 specifies theport 5 of the managed switchingelement820 through which thepacket1330 is to be sent in order for thepacket1330 to reachVM 4. In this case, the managed switchingelement820 is to send thepacket1330 out ofport 5 of managed switchingelement820 that is coupled toVM 4. In some embodiments, the managed switchingelement820 removes the logical context from thepacket1330 before sending the packet toVM 4.
FIG. 14 conceptually illustrates aprocess1400 that some embodiments perform to forward a packet to determine to which managed switching element to send a packet. Theprocess1400, in some embodiments, is performed by a managed edge switching element that receives a packet and forwards that packet to another managed switching element or a destination machine for the packet.
Theprocess1400 begins by receiving (at1405) a packet from a source machine. Theprocess1400 then performs (at1410) a portion of L2 processing. As the process performs the L2 processing, theprocess1400 determines (at1415) whether the packet needs to be sent to a second level managed switching element for further processing of the packet. In some embodiments, the process makes this determination based on the destination L2 address of the packet. The process looks at the destination L2 address and sends out the packet through a port that is associated with the destination L2 address. For instance, when the packet's destination L2 address is an L2 address of an L3 router, the process sends the packet out of a port that is associated with the managed switching element that is associated with an L3 router. When the packet's destination L2 address is an L2 address of the destination machine, the process sends the packet to the managed switching element that is directly connected to the destination machine or to the managed switching element that is closer in the route to the destination machine.
When theprocess1400 determines (at1415) that the packet needs to be sent to a second level managed switching element, theprocess1400 sends (at1420) the packet to a second-level managed switching element that is communicatively coupled to an L3 router that implements the logical router. Otherwise, theprocess1400 sends (at1425) the packet to the destination machine or to another managed switching element. The process then ends.
FIG. 15 conceptually illustrates thehost890 described above. Specifically, when the managed switchingelement810 receives a packet from an L3 router and the packet is headed to another L3 router implemented in thesame host890, the managed switchingelement810 directly bridges the two L3 routers based on the flow entries.
As shown, the managed switchingelement810 is coupled to twoL3 routers 1 and 2. The flow entries that the managed switchingelement810 contains are shown on the right side of the figure. The flow entries indicate that the traffic that is addressed to go from one L3 router to another L3 router should directly go to the other L3 router.
Also, this figure illustrates that the additional router can be provisioned in thehost890 in order to provide additional routing resources when more managed switching elements are provisioned and rely on the existing L3 router to route additional network traffic.
FIG. 16 conceptually illustrates aprocess1600 that some embodiments use to directly forward a packet from a first L3 router to a second L3 router when the first and the second L3 routers are implemented in the same host. Theprocess1600, in some embodiments, is performed by a managed switching element, such as the managed switchingelement810 described above, which exchanges packets with two or more L3 routers implemented in a single host.
Theprocess1600 begins by receiving (at1605) a packet from a first L3 router. Theprocess1600 then determines (at1610) whether the packet is addressed to a second L3 router that is implemented in the same host in which the first L3 router is implemented. Theprocess1600 determines this by examining the information in the header of the packet (e.g., destination MAC address).
When theprocess1600 determines (at1610) that the packets are headed to the second L3 router, theprocess1600 sends the packet to the second L3 router. Otherwise, theprocess1600 sends the packet toward the destination of the packet (e.g., another managed switching element or a destination machine). Theprocess1600 then ends.
FIGS. 17-24 illustrate a centralized logical router implemented in a managed switching element based on flow entries of the managed switching element.FIG. 17 conceptually illustrates an example implementation of thelogical processing pipeline200 described above by reference toFIG. 2.FIG. 17 illustrates anetwork architecture1700. In thenetwork architecture1700, thelogical processing pipeline200 is performed by three managed switchingelements1715,1720, and1725. In particular, theL2 processing205 and theL2 processing215 are performed in a distributed manner across managed switchingelements1715,1720, and1725. TheL3 processing210 is performed by the managed switchingelement1720 based on flow entries of the managed switchingelement1720.FIG. 17 also illustratessource machine1710 anddestination machine1730.
The managedswitching element1715 is similar to the managed switchingelement615 described above by reference toFIG. 6 in that the managed switchingelement1715 is also an edge switching element that directly receives the packets from a machine coupled to the edge switching element. The managedswitching element1715 receives packets from thesource machine1710. When the managed switchingelement1715 receives a packet from thesource machine1710, the managed switchingelement1715 performs a portion of theL2 processing205 on the packet in order to logically forward the packet. When the packet is headed to thedestination machine1730, which is in another logical network, the packet is forwarded to the managed switchingelement1720
There may be one or more managed switching elements (not shown) between the managed switchingelement1715 and the managed switchingelement1720. These managed switching elements have network constructs (e.g., PIFs, VIFs, etc.) to which the logical constructs (e.g., logical ports) of the logical switch220 (not shown inFIG. 17) are mapped.
The managedswitching element1720 is a second-level managed switching element that functions as an extender in some embodiments. The managedswitching element1720 performs the rest of theL2 processing205 and also performs theL3 processing210. The managedswitching element1720 also performs a portion of the L2 processing215 of thelogical processing pipeline200. The managedswitching element1720 then sends the packet to the managed switchingelement1725.
There may be one of more managed switching elements (not shown) between the managed switchingelement1720 and the managed switchingelement1725. These managed switching elements have network constructs to which the logical constructs of the logical switch220 (not shown inFIG. 17) are mapped.
The managedswitching element1725 in the example receives the packet from the managed switchingelement1720. The managedswitching element1725 performs the rest of theL2 processing215 on the packet in order to logically forward the packet. In this example, the managed switchingelement1725 is also the switching element that directly sends the packet to thedestination machine1730. However, there may be one or more managed switching elements (not shown) between the managed switchingelement1725 and thedestination machine1130. These managed switching elements have network constructs to which the logical constructs of the logical switch230 (not shown inFIG. 17) are mapped.
Although theL2 processing205 and theL2 processing215 are performed in a distributed manner in this example, theL2 processing205 and theL2 processing215 do not have to be performed in a distributed manner. For instance, the managed switchingelement1715 may perform theentire L2 processing205 and the managed switchingelement1725 may perform theentire L2 processing215. In such case, the managed switchingelement1720 would perform only the L3 processing210 of thelogical processing pipeline200.
FIG. 18 conceptually illustrates thelogical processing pipeline200 of some embodiments for processing a packet through thelogical switch220, thelogical router225, and thelogical switch230. Specifically, this figure illustrates thelogical processing pipeline200 when performed in thenetwork architecture1700 described above by reference toFIG. 17. As described above, in thenetwork architecture1700, theL2 processing205, theL3 processing210, and theL2 processing215 are performed by the managed switchingelements1715,1720, and1725.
TheL2 processing205, in some embodiments, includes seven stages1805-1835 for processing a packet through the logical switch220 (not shown inFIG. 18) in a logical network (not shown) that is implemented across the managed switchingelements1715 and1720. In some embodiments, the managed switchingelement1715 that receives the packet performs a portion of theL2 processing205 when the managed switchingelement1715 receives the packet. The managedswitching element1720 then performs the rest of theL2 processing205.
The first five stages1805-1825 are similar to the first five stages705-725 described above by reference toFIG. 7. In thestage1805 of theL2 processing205, ingress context mapping is performed on the packet to determine the logical context of the packet. In some embodiments, thestage1805 is performed when thelogical switch220 receives the packet (e.g., the packet is initially received by the managed switching element1715). After thefirst stage1805 is performed, some embodiments store the information that represents the logical context in one or more fields of the packet's header.
In some embodiments, thesecond stage1810 is defined for thelogical switch220. In some such embodiments, thestage1810 operates on the packet's logical context to determine ingress access control of the packet with respect to the logical switch. For example, an ingress ACL is applied to the packet to control the packet's access to the logical switch when the logical switch receives the packet. Based on the ingress ACL defined for the logical switch, the packet may be further processed (e.g., by the stage1815) or the packet may be dropped, for example.
In thethird stage1815 of theL2 processing205, an L2 forwarding is performed on the packet in the context of the logical switch. In some embodiments, thethird stage1815 operates on the packet's logical context to process and forward the packet with respect to thelogical switch220. For instance, some embodiments define an L2 forwarding table or L2 forwarding entries for processing the packet atlayer 2. Moreover, when the packet's destination is in another logical network (i.e., when the packet's destination logical network is different than the logical network whose traffic is processed by the logical switch220), thelogical switch220 sends the packet to thelogical router225, which will then perform theL3 processing210 in order to route the packet to the destination logical network. Thus, at thethird stage1815, the managed switchingelement1715 determines that the packet should be forwarded to thelogical router225 through a logical port (not shown) of the logical switch that is associated with thelogical router225.
At thefourth stage1820, egress context mapping is performed to identify a physical result that corresponds to the result of the logical forwarding of the packet. For example, the logical processing of the packet may specify that the packet is to be sent out of one or more logical ports (e.g., a logical egress port) of thelogical switch220. As such, the egress context mapping operation identifies a physical port(s) of one or more of the managed switching elements (including the managed switchingelements1715 and1720) that corresponds to the particular logical port of thelogical switch220. The managedswitching element1715 determines that the physical port (e.g. a VIF) to which the logical port determined at theprevious stage1815 is mapped is a port (not shown) of the managed switchingelement1720.
Thefifth stage1825 of theL2 processing205 performs a physical mapping based on the egress context mapping performed at thefourth stage1820. In some embodiments, the physical mapping determines operations for sending the packet towards the physical port that was determined in thefourth stage1820. For example, the physical mapping of some embodiments determines one or more queues (not shown) associated with one or more ports of the set of ports (not shown) of the managed switchingelement1715 that is performing theL2 processing205 through which to send the packet in order for the packet to reach the physical port(s) determined in thefourth stage1820. This way, the managed switching elements can forward the packet along the correct path in the network for the packet to reach the determined physical port(s).
As shown, thesixth stage1830 of theL2 processing205 is performed by the managed switchingelement1720. Thesixth stage1830 is similar to thefirst stage1805. Thestage1830 is performed when the managed switchingelement1720 receives the packet. At thestage1830, the managed switchingelement1720 looks up the logical context of the packet and determines that L2 egress access control is left to be performed.
Theseventh stage1835 of some embodiments is defined for thelogical switch220. Theseventh stage1835 of some such embodiments operates on the packet's logical context to determine egress access control of the packet with respect to thelogical switch220. For instance, an egress ACL may be applied to the packet to control the packet's access out of thelogical switch220 after logical forwarding has been performed on the packet. Based on the egress ACL defined for the logical switch, the packet may be further processed (e.g., sent out of a logical port of the logical switch or sent to a dispatch port for further processing) or the packet may be dropped, for example.
TheL3 processing210 includes six stages1840-1856 for processing a packet through the logical switch220 (not shown inFIG. 18) that is implemented in the managed switchingelement1720 based on the L3 flow entries of the managed switchingelement1720. As mentioned above, L3 processing involves performing a set of logical routing lookups for determining where to route the packet through alayer 3 network.
Thefirst stage1840 performs a logical ingress ACL lookup for determining access control when thelogical router225 receives the packet (i.e., when the managed switchingelement1720 which implements thelogical router225 receives the packet). Thenext stage1841 performs DNAT to revert the destination address of the packet back to the real address of the destination machine that is hidden from the source machine of the packet. Thisstage1841 is performed when DNAT is enabled.
Thenext stage1845 performs a logical L3 routing for determining one or more logical ports to which send the packet through thelayer 3 network based on the L3 addresses (e.g., destination IP address) of the packet and routing tables (e.g., containing L3 entries). Since thelogical router225 is implemented by the managed switchingelement1720, the L3 flow entries are configured in the managed switchingelement1720.
At thefourth stage1850, the managed switchingelement1720 of some embodiments also performs SNAT on the packet. For instance, the managed switchingelement1720 replaces the source IP address of the packet with a different IP address in order to hide the source IP address when the source NAT is enabled. Also, as will be described further below, the managed switching element may use a NAT daemon to receive flow entries for translating network addresses. A NAT daemon will be described further below by reference toFIG. 31.
Thefifth stage1855 performs logical L3 egress ACL lookups for determining access control before thelogical router225 routes the packet out of thelogical router225 through the port determined in thestage1845. The L3 egress ACL lookups are performed based on the L3 addresses (e.g., source and destination IP addresses) of the packet.
Thesixth stage1856 performs address resolution in order to translate the destination L3 address (e.g., a destination IP address) into a destination L2 address (e.g., a destination MAC address). In some embodiments, the managed switchingelement1720 uses a standard address resolution (e.g., by sending out ARP requests or looking up ARP cache) to find the destination L2 address that corresponds to the destination IP address. Also, as will be described further below, the managed switchingelement1720 of some embodiments may use an L3 daemon to receive flow entries for resolving L3 addresses into L2 addresses. An L3 daemon will be described further below by reference toFIGS. 48-50.
When thelogical router225 is not coupled to the destination logical network, thelogical switch220 sends the packet to another logical router network towards the destination logical network. When thelogical router225 is coupled to the destination logical network, thelogical switch220 routes the packet to the destination logical network (i.e., the logical switch that forwards the packet for the destination logical network).
TheL2 processing215, in some embodiments, includes seven stages1860-1890 for processing the packet through thelogical switch230 in another logical network (not shown inFIG. 18) that is implemented across the managed switchingelements1720 and1725 (not shown). The stages1860-1890 are similar to the stage1805-1835, respectively, except that the stage1860-1890 are performed by the logical switch230 (i.e., by the managed switchingelements1720 and1725 that implement the logical switch230).
FIG. 19 conceptually illustrates anexample network architecture1900 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture1900 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure a second-level managedswitching element1910, managed switchingelements1915 and1920 which are running inhosts1990,1980, and1985 (e.g., machines operated by operating systems such as Windows™ and Linux™), respectively. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
In this example, thelogical switch220 forwards data packets between thelogical router225,VM 1, andVM 2. Thelogical switch230 forwards data packets between thelogical router225,VM 3, andVM 4. As mentioned above, thelogical router225 routes data packets between thelogical switches220 and230 and possibly other logical routers and switches (not shown). Thelogical switches220 and230 and thelogical router225 are logically coupled through logical ports (not shown) and exchange packets through the logical ports. These logical ports are mapped to physical ports of the L3 router1960 and the managed switchingelements1910,1915 and1920.
In some embodiments, each of thelogical switches220 and230 is implemented across the managed switchingelements1915 and1920 and possibly other managed switching elements (not shown). In some embodiments, thelogical router225 is implemented in the L3 router1960 which is communicatively coupled to the managed switchingelement1910.
In this example, the managed switchingelements1910,1915 and1920 are software switching elements running inhosts1990,1980 and1985, respectively. The managedswitching elements1910,1915 and1920 have flow entries which implement thelogical switches220 and230. Using these flow entries, the managed switchingelements1915 and1920 forward network data (e.g., packets) between network elements in the network that are coupled to the managed switchingelements1910,1915 and1920. For instance, the managed switchingelement1915 routes network data betweenVMs 1 and 3, and the second-level managedswitching element1910. Similarly, the managed switchingelement1920 routes network data betweenVMs 2 and 4, and the second-level managedswitching element1910. As shown, the managed switchingelements1915 and1920 each have three ports (depicted as numbered squares) through which to exchange data packets with the network elements that are coupled to the managed switchingelements1915 and1920.
The managedswitching element1910 is similar to the managed switchingelement305 described above by reference toFIG. 4 in that the managed switchingelement1910 is a second-level managed switching element that functions as an extender. The managed switching element560 also implements thelogical router225 based on the flow entries. Using these flow entries, the managed switchingelement1910 route packets at L3. In this example, thelogical router225 implemented in the managed switchingelement1910 routes packets between thelogical switch220 that is implemented across the managed switchingelements1910 and1915 and thelogical switch230 implemented across the managed switchingelement1910 and1920.
In this example, the managed switchingelement1910 is coupled to the managed switchingelement1915, which runs in thehost1980, through a tunnel that terminates atport 2 of the managed switchingelement1915 as shown. Similarly, the managed switchingelement1910 is coupled to the managed switchingelement1920 through a tunnel that terminates atport 1 of the managed switchingelements1920.
In this example, each of thehosts1980 and1985 includes a managed switching element and several VMs as shown. The VMs 1-4 are virtual machines that are each assigned a set of network addresses (e.g., a MAC address for L2, an IP address for L3, etc.) and can send and receive network data to and from other network elements. The VMs are managed by hypervisors (not shown) running on thehosts1980 and1985.
Several example data exchanges through thenetwork architecture1900 will now be described. WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 2 that is also coupled to the samelogical switch220, the packet is first sent to the managed switchingelement1915. The managedswitching element1915 then performs theL2 processing205 on the packet because the managed switchingelement1915 is the edge switching element that receives the packet fromVM 1. The result of theL2 processing205 on this packet would indicate that the packet should be sent to the managed switchingelement1920 to get toVM 2 throughport 4 of the managed switchingelement1920. BecauseVMs 1 and 2 are in the same logical network and therefore L3 routing for the packet is not necessary, no L3 processing needs to be performed on this packet. The packet is then sent to the managed switchingelement1920 via the second-level managedswitching element1910 which is bridging between the managed switchingelement1915 and the managed switchingelement1920. The packet reachesVM 2 throughport 4 of the managed switchingelement1920.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 3 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement1915. The managedswitching element1915 performs a portion of L2 processing on the packet. However, because the packet is sent from one logical network to another (i.e., the logical L3 destination address of the packet is for another logical network), an L3 processing needs to be performed on this packet.
The managedswitching element1915 sends the packet to the second-level managedswitching element1910 so that the managed switchingelement1910 performs the rest of the L2 processing and theL3 processing210 on the packet. The managedswitching element1910 then performs a portion of another L2 processing and forwards the packet to the managed switchingelement1920. The managedswitching element1915 performs theL2 processing215 on the packet received from the managed switchingelement1910 and the result of this L2 processing would indicate that the packet should be sent toVM 3 throughport 5 of the managed switchingelement1915.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement1915. The managedswitching element1915 performs theL2 processing205 on the packet. However, because the packet is sent from one logical network to another, an L3 processing needs to be performed.
The managedswitching element1915 sends the packet to the managed switchingelement1910 so that the managed switchingelement1910 performs the rest ofL2 processing205 and theL3 processing210 on the packet. The result of theL3 processing210 performed at the managed switchingelement1910 would indicate that the packet should be sent to the managed switchingelement1915. The managedswitching element1910 then performs a portion of L2 processing on the packet and the result of this L2 processing would indicate that the packet should be sent toVM 4 through the managed switchingelement1920. The managedswitching element1920 performs the rest of the L2 processing to determine that the packet should be sent toVM 4 throughport 5 of the managed switchingelement1920.
FIG. 20 conceptually illustrates anexample network architecture2000 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture2000 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure the second-level managedswitching element1910, managed switchingelements1915 and1920 which are running inhosts1990,1980, and1985 respectively. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
Thenetwork architecture2000 is similar to thenetwork architecture1900 except that thenetwork architecture2000 additionally includes the managed switchingelement2005 which runs in thehost2010. The managedswitching element2005 of some embodiments is a second-level managed switching element that functions as a pool node.
In some embodiments, tunnels are established by the network control system (not shown) to facilitate communication between the network elements. For instance, the managed switchingelement1915 in this example is coupled to the managed switchingelement2005, which runs in thehost2010, through a tunnel that terminates atport 1 of the managed switchingelement1915 as shown. Similarly, the managed switchingelement1920 is coupled to the managed switchingelement2005 through a tunnel that terminates atport 2 of the managed switchingelements1920. Also, the managed switchingelements2005 and1910 are coupled through a tunnel as shown.
Thelogical router225 and thelogical switches220 and230 are implemented in the managed switchingelements1910,1915, and1920 as described by reference toFIG. 19 above, except that the second-level managedswitching element2005 is involved in the data packet exchange. That is, the managed switchingelements1915 and1910 exchange packets through the managed switchingelement2005. The managedswitching elements1920 and1910 exchange packets through the managed switchingelement2005. The managedswitching elements1915 and1920 exchange packets through the managed switchingelement2005.
FIG. 21 conceptually illustrates anexample network architecture2100 of some embodiments which implements thelogical router225 andlogical switches220 and230. Thenetwork architecture2100 is similar to thenetwork architecture1900 except that there is a tunnel established between the managed switchingelement1910 and the managed switchingelement1920. This figure illustrates that thenetwork architecture2100 of some embodiments is a mixture of thenetwork architecture1900 and thenetwork architecture2000. That is, some managed edge switching elements have tunnels to a second-level managed switching element that is coupled to a centralized L3 router while other managed edge switching elements have to go through a second-level managed switching element that functions as a pool node in order to exchange packets with a second-level managed switching element that is coupled to the centralized L3 router.
FIG. 22 conceptually illustrates an example architecture of thehost1990 of some embodiments that includes the managed switchingelement1910 that implements a logical router based on flow entries. Thehost1990, in some embodiments, is a machine that is managed by an operating system (e.g., Linux) that is capable of creating virtual machines. As shown, thehost1990 in this example includes a managedswitching element1910, and aNIC2245. This figure also illustrates acontroller cluster2205.
Thecontroller cluster2205 is a set of network controllers or controller instances that manage the network elements, including the managed switchingelement1910. The managedswitching element1910 in this example is a software switching element implemented in thehost1990 that includes auser space2212 and akernel2210. The managedswitching element1910 includes acontrol daemon2215 running in theuser space2212, and acontroller patch2230 and abridge2235 running in thekernel2210. Also running in theuser space2212 is aNAT daemon2250, which will be described further below. Theuser space2212 and thekernel2210, in some embodiments, are of an operating system for thehost1990 while in other embodiments theuser space2212 and thekernel2210 are of a virtual machine that is running on thehost1990.
In some embodiments, thecontroller cluster2205 communicates with a control daemon2215 (e.g., by using OpenFlow protocol or some other communication protocol), which, in some embodiments, is an application running in the background of theuser space2212. Thecontrol daemon2215 communicates with thecontroller cluster2205 in order to process and route packets that the managed switchingelement1910 receives. Specifically, thecontrol daemon2215, in some embodiments, receives configuration information from thecontroller cluster2205 and configures thecontroller patch2230. For example, thecontrol daemon2215 receives commands from thecontroller cluster2205 regarding operations for processing and routing packets at L2 and L3 that the managed switchingelement1910 receives.
Thecontroller patch2230 is a module that runs in thekernel2210. In some embodiments, thecontrol daemon2215 configures thecontroller patch2230. When configured, thecontroller patch2230 contains rules (e.g., flow entries) regarding processing, forwarding, and routing the packets to receive. Thecontroller patch2230 implements both logical switches and a logical router.
In some embodiments, thecontroller patch2230 uses the NAT daemon for network address translation. As will be described further below, theNAT daemon2250 generates flow entries regarding network address translation and sends back the flow entries to the managed switchingelement1910 to use. A NAT daemon will be described further below.
Thecontroller patch2230 receives packets from anetwork stack2250 of thekernel2210 or from thebridge2235. Thebridge2235 routes network data between thenetwork stack2250 and network hosts external to the host (i.e., network data received through the NIC2245). As shown, thebridge2235 routes network data between thenetwork stack2250 and theNIC2245 and between thenetwork stack2250 and theNIC2245. Thebridge2235 of some embodiments performs standard L2 packet learning and routing.
Thenetwork stack2250 can receive packets from network hosts external to the managed switchingelement1910 through theNIC2245. Thenetwork stack2250 then sends the packets to thecontroller patch2230. In some cases, the packets are received from network hosts external to the managed switching element through tunnels. In some embodiments, the tunnels terminate at thenetwork stack2250. Thus, when thenetwork stack2250 receives a packet through a tunnel, thenetwork stack2250 unwraps the tunnel header (i.e., decapsulates the payload) and sends the unwrapped packet to thecontroller patch2230.
An example operation of the managed switchingelement1910 will now be described. In this example, tunnels are established between the managed switchingelement1910 and the managed switchingelements1915 and1920 (not shown inFIG. 22) that are external to thehost1990. That is, the managed switchingelements1910,1915, and1920 are connected through the tunnels as illustrated inFIG. 19. The tunnels terminate at thenetwork stack2250.
The managedswitching element1915 sends a packet, sent by VM1 toVM 4, to the managed switchingelement1910. The packet is received by theNIC2245 and then is sent to thebridge2235. Based on the information in the packet header, thebridge2235 determines that the packet is sent over the established tunnel and sends the packet to thenetwork stack2250. Thenetwork stack2250 unwraps the tunnel header and sends the unwrapped packet to thecontroller patch2230.
According to the flow entries that thecontroller patch2230 has, thecontroller patch2230 performs L3 processing to route the packet because the packet is sent from one logical network to another logical network. By performing the L3 processing and some L2 processing, the managed switchingelement1910 determines that the packet should be sent to the managed switchingelement1920 because the destination network layer address should go to a logical switch that belongs to the destination logical network. Thecontroller patch2230 sends the packet through thenetwork stack2250, thebridge2235, and theNIC2245 over the tunnel to the managed switchingelement1920 that implements the logical switch that belongs to the destination logical network.
FIG. 23 conceptually illustrates an example implementation of logical switches and logical routers in managed switching elements. Specifically, this figure illustrates implementation of thelogical router225 and thelogical switches220 and230 in the second-level managedswitching element1910 and the managed switchingelements1915 and1920. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure the managed switching elements1910-1920. The figure illustrates VMs 1-4 in both the top and the bottom halves of the figure.
Thelogical switches220 and230 and thelogical router225 are logically coupled through logical ports. This particular configuration of thelogical switches220 and230 is the same as the configuration illustrated in an example described above by reference toFIG. 12.
In the example ofFIG. 23, the controller cluster2205 (not shown inFIG. 23) configures the managed switchingelement1910 by supplying flow entries to the managed switchingelement1910 such that the managed switching element implements thelogical router225 based on the flow entries.
FIG. 24 conceptually illustrates an example operation of thelogical switches220 and230, thelogical router225, and the managed switchingelements1910,1915 and1920 described above by reference toFIG. 23. Specifically,FIG. 24 illustrates an operation of the managed switchingelement1910, which implements thelogical router225. Portions of the logical processing pipeline that the managed switchingelements1915 and1920 perform are not depicted inFIG. 24 for simplicity. These portions of the logical processing pipeline are similar to the portions of logical processing performed by the managed switchingelements815 and820 in the example illustrated in the top half ofFIG. 13A andFIG. 13C. That is, for illustrating the example ofFIG. 24,FIG. 24 replaces the bottom half ofFIG. 13A andFIG. 13B.
As shown in the bottom half ofFIG. 24, the managed switchingelement1910 includesL2 entries2405 and2415 andL3 entries2410. These entries are flow entries that the controller cluster2205 (not shown) supplies to the managed switchingelement1910. Although these entries are depicted as three separate tables, the tables do not necessarily have to be separate tables. That is, a single table may include all these flow entries.
When the managed switchingelement1910 receives apacket2430 from the managed switchingelement1915 that is sent fromVM 1 towardsVM 4, the managed switchingelement1910 begins processing thepacket2430 based on theflow entries2405 of the managed switchingelement1910. The managedswitching element1910 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping of thestage1830. Therecord 1 identifies thepacket2430's logical context based on the logical context that is stored in thepacket2430's header. The logical context specifies that thepacket2430 has been processed by the portion of logical processing (i.e., L2 ingress ACL, L2 forwarding) performed by the managed switchingelement1915. As such, therecord 1 specifies that thepacket2430 be further processed by the forwarding tables (e.g., by sending thepacket2430 to a dispatch port).
Next, the managed switchingelement1910 identifies, based on the logical context and/or other fields stored in thepacket2430's header, a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the egress ACL of thestage1835. In this example, therecord 2 allows thepacket2430 to be further processed (e.g., thepacket2430 can get out of thelogical switch220 through port “X” of the logical switch220) and, thus, specifies thepacket2430 be further processed by the flow entries of the managed switching element1910 (e.g., by sending thepacket2430 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement1910 store the logical context (i.e., thepacket2430 has been processed by the stage2452 of the processing pipeline2400) of thepacket2430 in the set of fields of thepacket2430's header. (It is to be noted that all records specify that a managed switching element performing logical processing update the logical context store in the set of fields whenever a managed switching element performs some portion of logical processing based on a record.)
The managedswitching element1910 continues processing thepacket2430 based on the flow entries. The managedswitching element1910 identifies, based on the logical context and/or other fields stored in thepacket2430's header, a record indicated by an encircled 3 (referred to as “record 3”) in theL3 entries2410 that implements L3 ingress ACL by specifying that the managed switchingelement1910 should accept the packet through thelogical port 1 of thelogical router225 based on the information in the header of thepacket2430.
The managedswitching element1910 then identifies a flow entry indicated by an encircled 4 (referred to as “record 4”) in theL3 entries2410 that implementsL3 routing1845 by specifying that thepacket2430 with its destination IP address (e.g., 1.1.2.10) should be allowed to exit out ofport 2 of thelogical router225. Also, the record 4 (or another record in the routing table, not shown) indicates that the source MAC address for thepacket2430 is to be rewritten to the MAC address ofport 2 of the logical router225 (i.e., 01:01:01:01:01:02). The managedswitching element1910 then identifies a flow entry indicated by an encircled 5 (referred to as “record 5”) in theL3 entries2410 that implements L3 egress ACL by specifying that the managed switchingelement1910 can send the packet out throughport 2 of thelogical router225 based on the information (e.g., source IP address) in the header of thepacket2430.
Based on the logical context and/or other fields stored in thepacket2430's header, the managed switchingelement1910 identifies a record indicated by an encircled 6 (referred to as “record 6”) in theL2 entries2415 that implements the ingress ACL of thestage1860. In this example, therecord 6 allows thepacket2430 to be further processed and, thus, specifies thepacket2430 be further processed by the managed switching element1910 (e.g., by sending thepacket2430 to a dispatch port). In addition, therecord 6 specifies that the managed switchingelement1910 store the logical context (i.e., thepacket2430 has been processed by the stage2462 of the processing pipeline2400) of thepacket2430 in the set of fields of thepacket2430's header.
Next, the managed switchingelement1910 identifies, based on the logical context and/or other fields stored in thepacket2430's header, a record indicated by an encircled 7 (referred to as “record 7”) in the forwarding tables that implements the logical L2 forwarding of thestage1865. The record 7 specifies that a packet with the MAC address ofVM 4 as destination MAC address should be forwarded through a logical port (not shown) of thelogical switch230 that is connected toVM 4.
The record 7 also specifies that thepacket2430 be further processed by the forwarding tables (e.g., by sending thepacket2430 to a dispatch port). Also, the record 7 specifies that the managed switchingelement1910 store the logical context (i.e., thepacket2430 has been processed by thestage1865 of the processing pipeline2400) in the set of fields of thepacket2430's header.
Based on the logical context and/or other fields stored in thepacket2430's header, the managed switchingelement1910 identifies a record indicated by an encircled 8 (referred to as “record 8”) in the forwarding tables that implements the context mapping of thestage1870. In this example, therecord 8 identifiesport 5 of the managed switchingelement1920 to whichVM 4 is coupled as the port that corresponds to the logical port (determined at stage1865) of thelogical switch230 to which thepacket2430 is to be forwarded. Therecord 8 additionally specifies that thepacket2430 be further processed by the forwarding tables (e.g., by sending thepacket2430 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket2430's header, the managed switchingelement1910 then identifies a record indicated by an encircled 9 (referred to as “record 9”) in theL2 entries2415 that implements the physical mapping of thestage1875. Therecord 9 specifies a port (not shown) of the managed switchingelement1910 through which thepacket2430 is to be sent in order for thepacket2430 to reach the managed switchingelement1920. In this case, the managed switchingelement1910 is to send thepacket2430 out of that port of managed switchingelement1910 that is coupled to the managed switchingelement1920.
FIGS. 25-30B illustrate a distributed logical router implemented in several managed switching elements based on flow entries of the managed switching element. In particular,FIGS. 25-30B illustrate that the entire logical processing pipeline that includes source L2 processing, L3 routing, and destination L2 processing is performed by a first hop managed switching element (i.e., the switching element that receives a packet directly from a machine).
FIG. 25 conceptually illustrates an example implementation of thelogical processing pipeline200 described above by reference toFIG. 2. In particular,FIG. 25 illustrates that theL3 processing210 can be performed by any managed switching elements that directly receives a packet from a source machine.FIG. 25 illustrates anetwork architecture2500. In thenetwork architecture2500, thelogical processing pipeline200 is performed by a managedswitching element2505. In this example, theL3 processing210 is performed by the managed switchingelement2505 based on flow entries of the managed switchingelement2505.FIG. 25 also illustratessource machine2515 anddestination machine2520.
The managedswitching element2505 is an edge switching element that directly receives the packets from a machine coupled to the edge switching element. The managedswitching element2505 receives packets from thesource machine2515. When the managed switchingelement2505 receives a packet from thesource machine2515, the managed switchingelement805, in some embodiments, performs the entirelogical processing pipeline200 on the packet in order to logically forward and route the packet.
When a received packet is headed to thedestination machine2520, which is in another logical network in this example, the managed switchingelement2505 functions as a logical switch that is in the logical network to which thesource machine2515 belongs; a logical switch that is in the logical network to which thedestination machine2520 belongs; and a logical router that routes packets between these two logical switches. Based on the result of performinglogical processing pipeline200, the managed switchingelement2505 forwards the packet to the managed switchingelement2510 through which thedestination machine2520 receives the packet.
FIG. 26 conceptually illustrates thelogical processing pipeline200 of some embodiments for processing a packet through thelogical switch220, thelogical router225, and thelogical switch230. Specifically, this figure illustrates thelogical processing pipeline200 when performed in thenetwork architecture2500 described above by reference toFIG. 25. As described above, in thenetwork architecture2500, theL2 processing205, theL3 processing210, and theL2 processing215 are performed by the single managed switchingelement2505, which is an edge switching element that receives packet from machine. Hence, in these embodiments, the first-hop switching element performs the entirelogical processing pipeline200.
TheL2 processing205, in some embodiments, includes four stages2605-2620 for processing a packet through the logical switch220 (not shown inFIG. 26). In thestage2605, ingress context mapping is performed on the packet to determine the logical context of the packet. In some embodiments, thestage2605 is performed when thelogical switch220 receives the packet (e.g., the packet is initially received by the managed switching element2505).
In some embodiments, thesecond stage2610 is defined for thelogical switch220. In some such embodiments, thestage2610 operates on the packet's logical context to determine ingress access control of the packet with respect to the logical switch. For example, an ingress ACL is applied to the packet to control the packet's access to the logical switch when the logical switch receives the packet. Based on the ingress ACL defined for the logical switch, the packet may be further processed (e.g., by the stage2615) or the packet may be dropped, for example.
In thethird stage2615 of theL2 processing205, an L2 forwarding is performed on the packet in the context of the logical switch. In some embodiments, thethird stage2615 operates on the packet's logical context to process and forward the packet with respect to thelogical switch220. For instance, some embodiments define an L2 forwarding table or L2 forwarding entries for processing the packet atlayer 2.
Thefourth stage2620 of some embodiments is defined for thelogical switch220. Thefourth stage2620 of some such embodiments operates on the packet's logical context to determine egress access control of the packet with respect to the logical switch. For instance, an egress ACL may be applied to the packet to control the packet's access out of thelogical switch220 after logical forwarding has been performed on the packet. Based on the egress ACL defined for the logical switch, the packet may be further processed (e.g., sent out of a logical port of the logical switch or sent to a dispatch port for further processing) or the packet may be dropped, for example.
When the packet's destination is in another logical network (i.e., when the packet's destination logical network is different than the logical network whose traffic is processed by the logical switch220), thelogical switch220 sends the packet to thelogical router225, which then performs the L3 processing atstage210 in order to route the packet to the destination logical network. TheL3 processing210 includes six stages2635-2651 for processing a packet through the logical router225 (not shown inFIG. 26) that is implemented by the managed switching element2505 (not shown inFIG. 26). As mentioned above, L3 processing involves performing a set of logical routing lookups for determining where to route the packet through alayer 3 network.
Thefirst stage2635 performs a logical ingress ACL lookup for determining access control when thelogical router225 receives the packet (i.e., when the managed switchingelement2505 which implements thelogical router225 receives the packet). In some embodiments, thestage2635 operates on the packet's logical context to determine ingress access control of the packet with respect to thelogical router225. Thenext stage2636 performs DNAT to revert the destination address of the packet back to the real address of the destination machine that is hidden from the source machine of the packet. Thisstage2636 is performed when DNAT is enabled.
Thenext stage2640 performs a logical L3 routing for determining one or more logical ports to send the packet through thelayer 3 network based on the L3 addresses (e.g., destination IP address) of the packet, forwarding tables containing L3 flow entries, and the packet's logical context.
Thefourth stage2645 of some embodiments performs SNAT on the packet. For instance, the managed switchingelement2505 replaces the source IP address of the packet with a different IP address in order to hide the source IP address when the SNAT is enabled. Also, as will be described further below, the managed switching element may use a NAT daemon to receive flow entries for translating network addresses. A NAT daemon will be described further below by reference toFIG. 31.
Thefifth stage2650 performs logical egress ACL lookups for determining access control before thelogical router225 routes the packet out of thelogical router225 through the port determined in thestage2640. The egress ACL lookups are performed based on the L3 addresses (e.g., source and destination IP addresses) of the packet. In some embodiments, thestage2650 operates on the packet's logical context to determine egress access control of the packet with respect to thelogical router225.
Thesixth stage2651 performs address resolution in order to translate the destination L3 address (e.g., a destination IP address) into a destination L2 address (e.g., a destination MAC address). In some embodiments, the managed switchingelement2505 uses a standard address resolution (e.g., by sending out ARP requests or looking up ARP cache) to find the destination L2 address that corresponds to the destination IP address. Also, as will be described further below, the managed switchingelement2505 of some embodiments may use an L3 daemon to receive flow entries for resolving L3 addresses into L2 addresses. An L3 daemon will be described further below by reference toFIGS. 48-50.
When thelogical router225 is not coupled to the destination logical network, thelogical switch220 sends the packet to another logical router network towards the destination logical network. A portion of the logical processing that corresponds to the operation of the other logical router would also be implemented in the managed switchingelement2505. When thelogical router225 is coupled to the destination logical network, thelogical switch220 routes the packet to the destination logical network (i.e., the logical switch that forwards the packet for the destination logical network).
TheL2 processing215, in some embodiments, includes five stages2660-2680 for processing the packet through the logical switch225 (not shown inFIG. 26). In some embodiments, thefirst stage2660 is defined for thelogical switch225. In some such embodiments, thestage2660 operates on the packet's logical context to determine ingress access control of the packet with respect to thelogical switch230. For example, an ingress ACL is applied to the packet to control the packet's access to thelogical switch230 when thelogical switch230 receives the packet from thelogical router225. Based on the ingress ACL defined for the logical switch, the packet may be further processed (e.g., by the stage2665) or the packet may be dropped, for example.
In thesecond stage2665 of theL2 processing pipeline215, an L2 forwarding is performed on the packet in the context of the logical switch. In some embodiments, thethird stage2665 operates on the packet's logical context to process and forward the packet with respect to thelogical switch220. For instance, some embodiments define an L2 forwarding table or L2 forwarding entries for processing the packet atlayer 2.
Thethird stage2670 of some embodiments is defined for thelogical switch220. Thethird stage2670 of some such embodiments operates on the packet's logical context to determine egress access control of the packet with respect to the logical switch. For instance, an egress ACL may be applied to the packet to control the packet's access out of thelogical switch230 after logical forwarding has been performed on the packet. Based on the egress ACL defined for the logical switch, the packet may be further processed (e.g., sent out of a logical port of the logical switch or sent to a dispatch port for further processing) or the packet may be dropped, for example.
In thefourth stage2675, egress context mapping is performed to identify a physical result that corresponds to the result of the logical forwarding of the packet. For example, the logical processing of the packet may specify that the packet is to be sent out of one or more logical ports (e.g., a logical egress port) of thelogical switch230. As such, the egress context mapping operation identifies a physical port(s) of one or more of the managed switching elements (including the managed switching element2505) that corresponds to the particular logical port of the logical switch.
Thefifth stage2680 of theL2 processing215 performs a physical mapping based on the egress context mapping performed at thefourth stage2675. In some embodiments, the physical mapping determines operations for forwarding the packet to the physical port that was determined in thefourth stage2675. For example, the physical mapping of some embodiments determines one or more queues (not shown) associated with one or more ports of the set of ports (not shown) of the managed switchingelement2505 through which to send the packet in order for the packet to reach the physical port(s) determined in thefourth stage2675. This way, the managed switching elements can route the packet along the correct path in the network for the packet to reach the determined physical port(s). Also, some embodiments remove the logical context after thefifth stage2680 is completed in order to return the packet to its original state before the logical processing pipeline2600 was performed on the packet.
FIG. 27 conceptually illustrates anexample network architecture2700 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture2700 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates, in the bottom half of the figure, the managed switchingelements2505 and2510. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
In this example, thelogical switch220 forwards data packets between thelogical router225,VM 1, andVM 2. Thelogical switch230 forwards data packets between thelogical router225,VM 3, andVM 4. As mentioned above, thelogical router225 routes data packets between thelogical switches220 and230 and other logical routers and switches (not shown). Thelogical switches220 and230 and thelogical router225 are logically coupled through logical ports (not shown) and exchange data packets through the logical ports. These logical ports are mapped or attached to physical ports of the managed switchingelements2505 and2510.
In some embodiments, a logical router is implemented in each managed switching element in the managed network. When the managed switching element receives a packet from a machine that is coupled to the managed switching element, the managed switching element performs the logical routing. In other words, a managed switching element of these embodiments that is a first-hop switching element with respect to a packet performs theL3 processing210.
In this example, the managed switchingelements2505 and2510 are software switching elements running inhosts2525 and2530, respectively. The managedswitching elements2505 and2510 have flow entries which implement thelogical switches220 and230 to forward and route the packets that the managed switchingelement2505 and2510 receive from VMs 1-4. The flow entries also implement thelogical router225. Using these flow entries, the managed switchingelements2505 and2510 can forward and route packets between network elements in the network that are coupled to the managed switchingelements2505 and2510. As shown, the managed switchingelements2505 and2510 each have three ports (e.g., VIFs) through which to exchange data packets with the network elements that are coupled to the managed switchingelements2505 and2510. In some cases, the data packets in these embodiments will travel through a tunnel that is established between the managed switchingelements2505 and2510 (e.g., the tunnel that terminates atport 3 of the managed switchingelement2505 andport 3 of the managed switching element2510).
In this example, each of thehosts2525 and2530 includes a managed switching element and several VMs as shown. The VMs 1-4 are virtual machines that are each assigned a set of network addresses (e.g., a MAC address for L2, an IP address for network L3, etc.) and can send and receive network data to and from other network elements. The VMs are managed by hypervisors (not shown) running on thehosts2525 and2530.
Several example data exchanges through thenetwork architecture2700 will now be described. WhenVM 1, that is coupled to thelogical switch220, sends a packet toVM 2 that is also coupled to the samelogical switch220, the packet is first sent to the managed switchingelement2505. The managedswitching element2505 then performs theL2 processing205 on the packet. The result of L2 processing would indicate that the packet should be sent to the managed switchingelement2510 over the tunnel established between the managed switchingelements2505 and2510 and get toVM 2 throughport 4 of the managed switchingelement2510. BecauseVMs 1 and 2 are in the same logical network, the managed switchingelement2505 does not perform theL3 processing210 and theL2 processing215.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 3 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505. The managedswitching element2505 performs theL2 processing205 on the packet. However, because the packet is sent from one logical network to another (i.e., the logical L3 destination address of the packet is for another logical network), theL3 processing210 needs to be performed. The managedswitching element2505 also performs theL2 processing215. That is, the managed switchingelement2505 as the first-hop switching element that receives the packet performs the entirelogical processing pipeline200 on the packet. The result of performing thelogical processing pipeline200 would indicate that the packet should be sent toVM 3 throughport 5 of the managed switchingelement2505. Thus, the packet did not have to go to another managed switching element although the packet did go through two logical switches and a logical router.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505. The managedswitching element2505, as the first-hop switching element for the packet, performs the entirelogical processing pipeline200 on the packet. The result of performing thelogical processing pipeline200 on this packet would indicate that the packet should be sent to the managed switchingelement2510 over the tunnel established between the managed switchingelements2505 and2510 and get toVM 4 throughport 5 of the managed switchingelement2510.
FIG. 28 conceptually illustrates anexample network architecture2800 of some embodiments which implements thelogical router225 andlogical switches220 and230. Specifically, thenetwork architecture2800 represents a physical network that effectuate logical networks whose data packets are switched and/or routed by thelogical router225 and thelogical switches220 and230. The figure illustrates in the top half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the bottom half of the figure the managed switchingelements2505 and2510. The figure illustrates VMs 1-4 in both the top and the bottom of the figure.
Thenetwork architecture2800 is similar to thenetwork architecture2700 except that thenetwork architecture2800 additionally includes the managed switchingelement2805. The managedswitching element2805 of some embodiments is a second-level managed switching element that functions as a pool node.
In some embodiments, tunnels are established by the network control system (not shown) to facilitate communication between the network elements. For instance, the managed switchingelement2505 in this example is coupled to the managed switchingelement2805, which runs in thehost2810, through a tunnel that terminates atport 1 of the managed switchingelement2505 as shown. Similarly, the managed switchingelement2510 is coupled to the managed switchingelement2805 through a tunnel that terminates atport 2 of the managed switchingelements2510. In contrast to theexample architecture2700 illustrated inFIG. 27 above, no tunnel is established between the managed switchingelements2505 and2510.
Thelogical router225 and thelogical switches220 and230 are implemented in the managed switchingelement2505 and the second-level managedswitching element2805 is involved in the data packet exchange. That is, the managed switchingelements2505 and2510 exchange packets through the managed switchingelement2805.
FIG. 29 conceptually illustrates an example of a first-hop switching element that performs all of L2 and L3 processing on a received packet to forward and route.FIG. 29 illustrates implementation of thelogical router225 and thelogical switches220 and230 by the managed switchingelements2505 and2510. As shown, the entirelogical processing pipeline200 is performed by the managed switchingelement2505 when the managed switchingelement2505 is a first-hop switching element. The figure illustrates in the left half of the figure thelogical router225 and thelogical switches220 and230. This figure illustrates in the right half of the figure the managed switchingelements2505 and2510. The figure illustrates VMs 1-4 in both the right and the left halves of the figure.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 2 that is also coupled to the samelogical switch220, the packet is first sent to the managed switchingelement2505 throughport 4 of the managed switchingelement2505 because alogical port 1 of thelogical switch220 through which the packet goes into thelogical switch220 is attached or mapped toport 4 of the managed switchingelement2505.
The managedswitching element2505 then performs theL2 processing205 on the packet. Specifically, the managed switchingelement2505 first performs a logical context look up to determine the logical context of the packet based on the information included in the header fields of the packet. In this example, the source MAC address of the packet is a MAC address ofVM 1 and the source IP address of the packet is an IP address ofVM 1. The destination MAC address of the packet is a MAC address ofVM 2 and destination IP address of the packet is an IP address ofVM 2. In this example, the logical context specifies thatlogical switch220 is the logical switch that is to forward the packet and thatlogical port 1 of thelogical switch220 is the port through which the packet was received. The logical context also specifies thatport 2 of thelogical switch220 is the port through which to send the packet out toVM 2 becauseport 2 is associated with the MAC address ofVM 2.
The managedswitching element2505 then performs logical forwarding lookups based on the determined logical context of the packet. The managedswitching element2505 determines access control for the packet. For instance, the managed switchingelement2505 determines that the packet does not have network addresses (e.g., source/destination MAC/IP addresses, etc.) that will cause thelogical switch220 to reject the packet that came throughport 1 of thelogical switch220. The managedswitching element2505 also identifies from the logical context thatport 2 of thelogical switch220 is the port to send out the packet. Furthermore, the managed switchingelement2505 determines access control for the packet with respect toport 2 of thelogical switch220. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical switch220 not to send the packet through theport 2 of thelogical switch220.
The managedswitching element2505 then performs a mapping lookup to determine a physical port to which thelogical port 2 of thelogical switch220 is mapped. In this example, the managed switchingelement2505 determines thatlogical port 2 of thelogical switch220 is mapped toport 4 of the managed switchingelement2510. The managedswitching element2505 then performs a physical lookup to determine operations for forwarding the packet to the physical port. In this example, the managed switchingelement2505 determines that the packet should be sent to the managed switchingelement2510 over the tunnel established between the managed switchingelements2505 and2510 and get toVM 2 throughport 4 of the managed switchingelement2510. BecauseVMs 1 and 2 are in the same logical network, the managed switchingelement2505 does not perform an L3 processing. The managedswitching element2510 does not perform any logical processing on the packet but just forwards the packet toVM 2 throughport 4 of the managed switchingelement2510.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 3 that is coupled to the logical switch230 (i.e., whenVMs 1 and 3 are in different logical networks), the packet is first sent to the managed switchingelement2505 throughport 4 of the managed switchingelement2505. The managedswitching element2505 performs theL2 processing205 on the packet. Specifically, the managed switchingelement2505 first performs a logical context look up to determine the logical context of the packet based on the information included in the header fields of the packet. In this example, the source MAC address of the packet is a MAC address ofVM 1 and the source IP address of the packet is an IP address ofVM 1. Because the packet is sent fromVM 1 toVM 3 that is in a different logical network, the packet has a MAC address associated with port X as the destination MAC address (i.e., 01:01:01:01:01:01 in this example). The destination IP address of the packet is an IP address of VM 3 (e.g., 1.1.2.10). In this example, the logical context specifies thatlogical switch220 is the logical switch that is to forward the packet and thatlogical port 1 of thelogical switch220 is the port through which the packet was received. The logical context also specifies that port X of thelogical switch220 is the port through which to send the packet out to thelogical router225 because port X is associated with the MAC address ofport 1 of thelogical router225.
The managedswitching element2505 then determines access control for the packet. For instance, the managed switchingelement2505 determines that the packet does not have network addresses (e.g., source/destination MAC/IP addresses, etc.) that will cause thelogical switch220 to reject the packet that came throughport 1 of thelogical switch220. The managedswitching element2505 also identifies from the logical context that port X of thelogical switch220 is the port to send out the packet. Furthermore, the managed switchingelement2505 determines access control for the packet with respect to port X. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical switch220 not to send the packet through the port X.
The managedswitching element2505 then performs theL3 processing210 on the packet because the packet's destination IP address, 1.1.2.10, is for another logical network (i.e., when the packet's destination logical network is different than the logical network whose traffic is processed by the logical switch220). The managedswitching element2505 determines access control for the packet at L3. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical router225 to reject the packet that came throughlogical port 1 of thelogical router225. The managedswitching element2505 also looks up the L3 flow entries and determines that the packet is to be sent to thelogical port 2 of thelogical router225 because the destination IP address of the packet, 1.1.2.10, belongs to the subnet address of 1.1.2.1/24 that is associated with thelogical port 2 of thelogical router225. Furthermore, the managed switchingelement2505 determines access control for the packet with respect to thelogical port 2 of thelogical router225. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical switch220 not to send the packet through thelogical port 2.
The managedswitching element2505 modifies the logical context of the packet or the packet itself while performing theL3 processing210. For instance, the managed switchingelement2505 modifies the logical source MAC address of the packet to be the MAC address of thelogical port 2 of the logical router225 (i.e., 01:01:01:01:01:02 in this example). The managedswitching element2505 also modifies the destination MAC address of the packet to be a MAC address ofVM 3.
The managedswitching element2505 then performs theL2 processing215. Specifically, the managed switchingelement2505 determines access control for the packet. For instance, the managed switchingelement2505 determines that the packet does not have network addresses (e.g., source/destination MAC/IP addresses, etc.) that will cause thelogical switch230 to reject the packet that came through port Y of thelogical switch230. The managedswitching element2505 then determines thatport 1 of thelogical switch230 is the port through which to send the packet out to the destination,VM 3. Furthermore, the managed switchingelement2505 determines access control for the packet with respect toport 1 of thelogical switch230. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical switch230 not to send the packet through theport 1 of thelogical switch230.
The managedswitching element2505 then performs a mapping lookup to determine a physical port to which thelogical port 1 of thelogical switch230 is mapped. In this example, the managed switchingelement2505 determines thatlogical port 1 of thelogical switch230 is mapped toport 5 of the managed switchingelement2505. The managedswitching element2505 then performs a physical lookup to determine operations for forwarding the packet to the physical port. In this example, the managed switchingelement2505 determines that the packet should be sent toVM 3 throughport 5 of the managed switchingelement2505. The managedswitching element2505 in this example removes the logical context from the packet before sending out the packet toVM 3. Thus, the packet did not have to go to another managed switching element although the packet did go through two logical switches and a logical router.
WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is sent toVM 4 in a similar manner in which the packet sent fromVM 1 toVM 3 is sent toVM 3, except that the packet heading toVM 4 is sent from the managed switchingelement2505 to the managed switchingelement2510 over the tunnel established between the managed switchingelements2505 and2510 and gets toVM 4 throughport 5 of the managed switchingelement2510.
FIGS. 30A-30B conceptually illustrate an example operation of thelogical switches220 and230, thelogical router225, and the managed switchingelements2505 and2510 described above by reference toFIG. 29. Specifically,FIG. 30A illustrates an operation of the managed switchingelement2505, which implements thelogical switches220 and230 andlogical router225.FIG. 30B illustrates an operation of the managed switchingelement2505.
As shown in the bottom half ofFIG. 30A, the managed switchingelement2505 includesL2 entries3005 and3015 andL3 entries3010. These entries are flow entries that a controller cluster (not shown) supplies to the managed switchingelement2505. Although these entries are depicted as three separate tables, the tables do not necessarily have to be separate tables. That is, a single table may include all these flow entries.
WhenVM 1 that is coupled to thelogical switch220 sends apacket3030 toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505 throughport 4 of the managed switchingelement2505. The managedswitching element2505 performs an L2 processing on the packet based on the forwarding tables3005-3015 of the managed switchingelement2505. In this example, thepacket3030 has a destination IP address of 1.1.2.10, which is the IP address ofVM 4. Thepacket3030's source IP address is 1.1.1.10. Thepacket3030 also hasVM 1's MAC address as a source MAC address and the MAC address of the logical port 1 (e.g., 01:01:01:01:01:01) of thelogical router225 as a destination MAC address.
The managedswitching element2505 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping of thestage2605. Therecord 1 identifies thepacket3030's logical context based on the inport, which is theport 4 through which thepacket3030 is received fromVM 1. In addition, therecord 1 specifies that the managed switchingelement2505 store the logical context of thepacket3030 in a set of fields (e.g., a VLAN id field) of thepacket3030's header. Therecord 1 also specifies thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port). A dispatch port is described in U.S. patent application Ser. No. 13/177,535.
Based on the logical context and/or other fields stored in thepacket3030's header, the managed switchingelement2505 identifies a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the ingress ACL of thestage2610. In this example, therecord 2 allows thepacket3030 to be further processed (i.e., thepacket3030 can get through the ingress port of the logical switch220) and, thus, specifies thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by the second stage3042 of the processing pipeline3000) of thepacket3030 in the set of fields of thepacket3030's header.
Next, the managed switchingelement2505 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 3 (referred to as “record 3”) in the forwarding tables that implements the logical L2 forwarding of thestage2615. Therecord 3 specifies that a packet with the MAC address of thelogical port 1 of thelogical router225 as a destination MAC address is to be sent to the logical port X of thelogical switch220.
Therecord 3 also specifies that thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port). Also, therecord 3 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by thethird stage2615 of the processing pipeline3000) in the set of fields of thepacket3030's header.
Next, the managed switchingelement2505 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 4 (referred to as “record 4”) in the forwarding tables that implements the egress ACL of thestage2620. In this example, therecord 4 allows thepacket3030 to be further processed (e.g., thepacket3030 can get out of thelogical switch220 through port “X” of the logical switch220) and, thus, specifies thepacket3030 be further processed by the flow entries of the managed switching element2505 (e.g., by sending thepacket3030 to a dispatch port). In addition, therecord 4 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by thestage2620 of the processing pipeline3000) of thepacket3030 in the set of fields of thepacket3030's header. (It is to be noted that all records specify that a managed switching element update the logical context store in the set of fields whenever the managed switching element performs some portion of logical processing based on a record.)
The managedswitching element2505 continues processing thepacket3030 based on the flow entries. The managedswitching element2505 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 5 (referred to as “record 5”) in theL3 entries3010 that implements L3 ingress ACL by specifying that the managed switchingelement2505 should accept the packet through thelogical port 1 of thelogical router225 based on the information in the header of thepacket3030.
The managedswitching element2505 then identifies a flow entry indicated by an encircled 6 (referred to as “record 6”) in theL3 entries3010 that implementsL3 routing2640 by specifying that thepacket3030 with its destination IP address (e.g., 1.1.2.10) should exit out ofport 2 of thelogical router225. Also, the record 6 (or another record in the routing table, not shown) indicates that the source MAC address for thepacket3030 is to be rewritten to the MAC address ofport 2 of the logical router225 (i.e., 01:01:01:01:01:02).
The managedswitching element2505 then identifies a flow entry indicated by an encircled 7 (referred to as “record 7”) in theL3 entries3010 that implements L3 egress ACL by specifying that the managed switchingelement2505 allow the packet to exit out throughport 2 of thelogical router225 based on the information (e.g., source IP address) in the header of thepacket3030.
Based on the logical context and/or other fields stored in thepacket3030's header, the managed switchingelement2505 identifies a record indicated by an encircled 8 (referred to as “record 8”) in theL2 entries3015 that implements the ingress ACL of thestage2660. In this example, therecord 8 specifies thepacket3030 be further processed by the managed switching element2505 (e.g., by sending thepacket3030 to a dispatch port). In addition, therecord 8 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by thestage2660 of the processing pipeline3000) of thepacket3030 in the set of fields of thepacket3030's header.
Next, the managed switchingelement2505 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 9 (referred to as “record 9”) in theL2 entries3015 that implements the logical L2 forwarding of thestage2665. Therecord 9 specifies that a packet with the MAC address ofVM 4 as the destination MAC address should be forwarded through a logical port (not shown) of thelogical switch230 that is connected toVM 4.
Therecord 9 also specifies that thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port). Also, therecord 9 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by thestage2665 of the processing pipeline3000) in the set of fields of thepacket3030's header.
Next, the managed switchingelement2505 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 10 (referred to as “record 10”) in the forwarding tables that implements the egress ACL of thestage2670. In this example, therecord 10 allows thepacket3030 to exit through a logical port (not shown) that connects toVM 4 and, thus, specifies thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port). In addition, therecord 10 specifies that the managed switchingelement2505 store the logical context (i.e., thepacket3030 has been processed by thestage2670 of the processing pipeline3000) of thepacket3030 in the set of fields of thepacket3030's header.
Based on the logical context and/or other fields stored in thepacket3030's header, the managed switchingelement2505 identifies a record indicated by an encircled 11 (referred to as “record 11”) in theL2 entries3015 that implements the context mapping of thestage2675. In this example, the record 11 identifiesport 5 of the managed switchingelement2510 to whichVM 4 is coupled as the port that corresponds to the logical port (determined at stage2665) of thelogical switch230 to which thepacket3030 is to be forwarded. The record 11 additionally specifies that thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket3030's header, the managed switchingelement2505 then identifies a record indicated by an encircled 12 (referred to as “record 12”) in theL2 entries3015 that implements the physical mapping of thestage2680. Therecord 12 specifiesport 3 of the managed switchingelement2505 as a port through which thepacket3030 is to be sent in order for thepacket3030 to reach the managed switchingelement2510. In this case, the managed switchingelement2505 is to send thepacket3030 out ofport 3 of managed switchingelement2505 that is coupled to the managed switchingelement2510.
As shown inFIG. 30B, the managed switchingelement2510 includes a forwarding table that includes rules (e.g., flow entries) for processing and routing thepacket3030. When the managed switchingelement2510 receives thepacket3030 from the managed switchingelement2505, the managed switchingelement2510 begins processing thepacket3030 based on the forwarding tables of the managed switchingelement2510. The managedswitching element2510 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping. Therecord 1 identifies thepacket3030's logical context based on the logical context that is stored in thepacket3030's header. The logical context specifies that thepacket3030 has been processed by the entirelogical processing200, which were performed by the managed switchingelement2505. As such, therecord 4 specifies that thepacket3030 be further processed by the forwarding tables (e.g., by sending thepacket3030 to a dispatch port).
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket3030's header, a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the physical mapping. Therecord 2 specifies theport 5 of the managed switchingelement2510 through which thepacket3030 is to be sent in order for thepacket3030 to reachVM 4. In this case, the managed switchingelement2510 is to send thepacket3030 out ofport 5 of managed switchingelement2510 that is coupled toVM 4. In some embodiments, the managed switchingelement2510 removes the logical context from thepacket3030 before sending the packet toVM 4.
FIG. 31 conceptually illustrates an example software architecture of a host on which a managed switching element runs. Specifically, this figure illustrates that the managed switching element that runs a logical processing pipeline to logically forward and route packets uses a NAT daemon for translating network addresses. This figure illustrates ahost3100, a managedswitching element3105, a forwarding table3120, aNAT daemon3110, and a NAT table3115 in the top half of the figure. This figure illustratesflow entries3125 and3130.
Theflow entries3125 and3130 are flow entries that each has a qualifier and an action. The text illustrated asflow entries3125 and3130 may not be the actual format. Rather, the text is just a conceptual illustration of a qualifier and an action pair. In some embodiments, flow entries have priorities and a managed switching element takes the action of the flow entry with the highest priority when qualifiers for more than one flow entry are satisfied.
Thehost3100, in some embodiments, is a machine operated by an operating system (e.g., Windows™ and Linux™) that is capable of running a set of software applications. The managedswitching element3105 of some embodiments is a software switching element (e.g., Open vSwitch) that executes in thehost3100. As mentioned above, a controller cluster (not shown) configures a managed switching element by supplying flow entries that specify the functionality of the managed switching element. The managedswitching element3105 of some embodiments does not itself generate flow entries.
The managedswitching element3105 of some embodiments runs all or part of thelogical processing pipeline200 described above. In particular, the managed switchingelement3105 is a managed switching element (e.g., the managed switchingelements1720 or2505) that performs theL3 processing210 to route packets received from the machines if necessary, based on flow entries in the forwarding table3120. In some embodiments, the managed switchingelement3105 is an edge switching element that receives a packet from a machine (not shown) that is coupled to the managed switching element. In some such embodiments, one or more virtual machines (not shown) are running in thehost3100 and are coupled to the managed switchingelements3105. In other embodiments, the managed switching element is a second-level managed switching element.
When the managed switchingelement3105 is configured to perform network address translation (NAT), the managed switchingelement3105 of some embodiments uses theNAT daemon3110 for performing NAT on packets. In some embodiments, the managed switchingelement3105 does not maintain a lookup table for finding an address to which to translate from a given address. Instead, the managed switchingelement3105 asks theNAT daemon3110 for addresses.
TheNAT daemon3110 of some embodiments is a software application running on thehost3100. TheNAT daemon3110 maintains the table3115 which includes pairings of addresses where each pair includes two addresses to be translated into each other. When the managed switchingelement3105 asks for an address to which to translate from a given address, the NAT daemon looks up the table3115 to find the address into which the given address should be translated.
The managedswitching element3105 and theNAT daemon3110 of different embodiments use different techniques to ask for and supply addresses. For instance, the managed switchingelement3105 of some embodiments sends a packet, which has an original address but does not have a translated address, to the NAT daemon. TheNAT daemon3110 of these embodiments translates the original address into a translated address. TheNAT daemon3110 sends the packet back to the managed switchingelement3105, which will perform logical forwarding and/or routing to send the packet towards the destination machine. In some embodiments, the managed switchingelement3105 initially sends metadata, along with the packet that contains an original address to resolve, to theNAT daemon3110. This metadata includes information (e.g., register values, logical pipeline state, etc.) that the managed switchingelement3105 uses to resume performing the logical processing pipeline when the managed switchingelement3105 receives the packet back from theNAT daemon3110.
In other embodiments, the managed switchingelement3105 of some embodiments requests addresses by sending a flow template, which is a flow entry that does not have actual values for the addresses, to theNAT daemon3110. The NAT daemon finds out the addresses to fill in the flow template by looking up the table3115. TheNAT daemon3110 then sends the flow template that is filled in with actual addresses back to the managed switchingelement3110 by putting the filled-in flow template into the forwarding table3120. In some embodiments, the NAT daemon assigns a priority value to the filled-in flow template that is higher than the priority value of the flow template that is not filled in. Moreover, when theNAT daemon3110 fails to find a translated address, the NAT daemon would specify in the flow template to drop the packet.
An example operation of the managed switchingelement3105 and theNAT daemon3110 will now be described in terms of three different stages 1-3 (encircled 1-3). In this example, the managed switchingelement3105 is a managed edge switching element that receives a packet to forward and route from a machine (not shown). The managedswitching element3105 receives a packet and performs theL3 processing210 based on the flow entries in the forwarding table3120.
While performing theL3 processing210 on the packet, the managed switching element3105 (at stage 1) identifies theflow entry3125 and performs the action specified in theflow entry3125. As shown, theflow entry3125 indicates that a flow template having an IP address 1.1.1.10 to be translated to X should be sent to theNAT daemon3110. In this example, theflow entry3125 has a priority value of N, which is a number in some embodiments.
Atstage 2, theNAT daemon3110 receives the flow template and finds out that 1.1.1.10 is to be translated into 2.1.1.10 by looking up the NAT table3115. The NAT daemon fills out the flow template and inserts the filled-in template (now the flow entry3130) into the forwarding table3120. In this example, the NAT daemon assigns a priority of N+1 to the filled-in template.
Atstage 3, the managed switchingelement3110 uses theflow entry3130 to change the address for the packet. Also, for the packets that the managed switchingelement3105 subsequently processes, the managed switchingelement3105 usesflow entry3130 over theflow entry3125 when a packet has the source IP address of 1.1.1.10.
In some embodiments, theNAT daemon3110 and the managed switching element run in a same virtual machine that is running on thehost3100 or in different virtual machines running on thehost3100. TheNAT daemon3110 and the managed switching element may also run in separate hosts.
FIG. 32 conceptually illustrates aprocess3200 that some embodiments perform to translate network addresses. In some embodiments, theprocess3200 is performed by a managed switching element that performs anL3 processing210 to route packets at L3 (e.g., the managed switchingelements1720,2505, or3105). Theprocess3200, in some embodiments, starts when the process receives a packet that is to be logically routed at L3.
Theprocess3200 begins by determining (at3205) whether the packet needs network address translation (NAT). In some embodiments, the process determines whether the packet needs NAT based on flow entry. The flow entry, of which the qualifier matches the information stored in the packet's header or logical context, specifies that the packet needs NAT. As mentioned above, NAT could be SNAT or DNAT. The flow entry would also specify which NAT is to be performed on the packet.
When theprocess3200 determines (at3205) that the packet does not need NAT, the process ends. Otherwise, theprocess3200 determines (at3210) whether theprocess3200 needs to request for an address into which to translate a packet's address (e.g., source IP address) from a NAT daemon. In some embodiments, theprocess3200 determines whether the process needs to ask the NAT daemon based on the flow entry. For instance, the flow entry may specify that the address into which to translate the packet's address should be obtained by requesting for the address from the NAT daemon. In some embodiments, the process determines that the NAT daemon should provide the translated address when the flow entry is a flow template that has an empty field for the translated address or some other value in the field for indicating the translated address should be obtained from the NAT daemon.
When the process determines (at3210) that the process does not need to request for an address from the NAT daemon, the process obtains (at3220) the translated address from the flow entry. For instance, the flow entry would provide the translated address. The process then proceeds to3225, which will be described further below. When the process determines (at3210) that the process needs to request for an address from the NAT daemon, theprocess3200 at3215 requests for and obtains the translated address from the NAT daemon. In some embodiments, theprocess3200 requests for the translated address by sending a flow template to the NAT daemon. The NAT daemon would fill the flow template with the translated address and will place that filled-in flow template in the forwarding table (not shown) that the process uses.
Next, theprocess3200 modifies (at3225) the packet with the translated address. In some embodiments, the process modifies an address field in the header of the packet. Alternatively or conjunctively, the process modifies the logical context to replace the packet's address with the translated address. The process then ends.
It is to be noted that the MAC addresses, IP addresses, and other network addresses used above and below in this application are examples for illustrative purpose and may not have the values in the allowable ranges unless specified otherwise.
II. Next-Hop Virtualization
Logical networks interfacing external networks need to interact with a next-hop router. The virtualization applications of different embodiments use different models to interface a logical L3 network with external networks through a next-hop router.
First, in a fixed attachment model, the physical infrastructure interacts with a set of managed integration elements that will receive all the ingress traffic for a given IP prefix and will send all the egress traffic back to the physical network. In this model, logical abstraction can be a single logical uplink port for the logical L3 router per a given set of managed integration elements. In some embodiments, there could be more than a single integration cluster. The logical control plane that is provided by the control application is responsible for routing outbound, egress traffic towards the uplink(s). In some embodiments, examples of managed integration elements include second-level managed switching elements that function as extenders, which are described in U.S. patent application Ser. No. 13/177,535. The examples of managed integration elements also include the managed switching element described above by reference toFIGS. 8, 9, and 10.
Second, in a distributed attachment model, the virtualization application distributes the attachment throughout managed edge switching elements that it connects. To do so, the managed edge switching elements have to integrate to the physical routing infrastructure. In other words, each managed edge switching element has to be able to communicate with the physical routing infrastructure outside of the group of managed switching elements. In some embodiments, these switching elements use the IGP protocol (or other routing protocol) to communicate with the physical switching elements (e.g., the physical routers) that send packets into the logical network (implemented by the managed switching elements) and receive packets from the logical network. Using this protocol the managed edge switching elements of some embodiments can advertise host routes (/32) to attract direct ingress traffic to its proper location. While, in some embodiments, there is no centralized traffic hotspot as the ingress and egress traffic is completely distributed over the managed switching elements, the logical abstraction is still a single logical uplink port for the logical L3 router and the logical control plane is responsible for routing traffic to the uplink. Nothing prevents having more than a single uplink port exposed for the logical control plane if that is beneficial for the control plane. However, the number of uplink ports does not have to match with the number of attachment points in this model.
Third, in a control plane driven model, the logical control plane is responsible for integrating with the external network. Control plane is exposed with one-to-one routing integration; for every attachment point in the physical network, there's a logical port. Logical control plane has the responsibility to peer with next-hop routers at the routing protocol level.
The three models all hit different design trade-offs: fixed attachment model implies non-optimal physical traffic routes, but require less integration with the physical infrastructure. Of the distributed models, the fully distributed model scales best, in some embodiments, as the logical control plane is not responsible for all the peering traffic, which in the extreme could be thousands of peering sessions. However, the control plane driven model gives the maximal control for the logical control plane. The maximal control requires policy routing, though, as the egress port has to depend on the ingress port if optimal physical routes are desired.
III. Stateful Packet Operations
Stateful packet operations place NAT on a logical L3 datapath for the routed traffic. In the logical pipeline, network address translation is done in an extra NAT stage before or after the actual standard L3 pipeline. In other words, network address translation hits the packet before or after the routing. In some embodiments, NAT configuration is done via flow templates that create the actual address translation entries. Flow templates will be further described below.
Placing the NAT functionality is one feature that deviates from the approach of performing all or most of the logical packet processing in first hop. The basic model of executing most or all of the operations at the first-hop places the processing of packets flowing in opposite directions at different first-hop switching elements in some embodiments: for a given transport level flow, the packets in one direction would be sent through the logical pipeline at one end, and the packets in the reverse direction would be sent through the pipeline at the other end. Unfortunately, the per flow NAT state can be fairly rich (especially if NAT supports higher level application protocols) and the state has to be shared between the directions, for a given transport flow.
Hence, some embodiments let the first-hop switching elements of the logical port receive the opening packet of the transport flow to execute the logical pipelines to both directions. For example, if VM A opens a TCP connection to VM B, then the edge switching element connected to the hypervisor (which may run on the same machine as the hypervisor) of VM A becomes responsible for sending the packets through the logical pipelines to both directions. This allows for purely distributed NAT functionality, as well as having multiple NATs in the logical network topology. The first-hop switching element will execute all the necessary NAT translations, regardless how many there are, and the network address translation just becomes an extra step in the LDPS pipelines the packet traverses (within that switching element).
However, placing the feeding of the packets sent in the reverse direction through the logical pipelines requires additional measures; otherwise, the first-hop switching element for the reverse packets will execute the processing (without having the NAT state locally available). For this purpose, some embodiments allow the first packet sent from the source edge switching element (of VM A above) to the destination edge switching element (of VM B above), to establish a special “hint state” that makes the destination switching element send the reverse packets of that transport flow directly to the source switching element without processing. The source switching element will then execute the pipelines in the reverse direction and reverse the NAT operations using the local NAT state for the reverse packets. Some embodiments use the flow templates (which are described below) to establish this reverse hint state at the destination switching element, so that the controller does not need to be involved per flow operations.
The next two figures,FIGS. 33 and 34 illustrate placing NAT functionality and the hint state.FIG. 33 conceptually illustrates that a first-hop switching element of some embodiments performs the entirelogical processing pipeline200 including theNAT operation2645.FIG. 33 is identical withFIG. 29 except that thelogical processing pipeline200 includes theNAT operation2645 depicted in theL3 processing220 to indicate that theNAT operation2645 is performed.
A managed switching element of some embodiments that implements a logical router performs a NAT operation on a packet after the packet is routed by the logical router. For instance, whenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the managed switchingelement2505 translates the source IP address (e.g., 1.1.1.10) of the packet into a different IP address (e.g., 3.1.1.10) before sending the packet out to the managed switchingelement2510. The managedswitching element2505 performs theNAT operation2645 based on a set of NAT rules (e.g., flow entries) configured in the managed switchingelement2505 by the controller cluster (not shown) that manages the managed switchingelement2505.
The packet thatVM 4 receives has the translated IP address, 3.1.1.10, as the packet's source IP address. A return packet fromVM 4 toVM 1 will have this translated address as the packet's destination IP address. Thus, the translated IP address has to be translated back toVM 1's IP address in order for this packet to reachVM 1. However, the managed switchingelement2510 of some embodiments would not perform theNAT operation2645 to recoverVM 1's IP address for the returning packet because the NAT rules for performing NAT operations are only in the managed switchingelement2505 and are not in the managed switchingelement2510. In this manner, the NAT rules and the state do not have to be shared by all potential managed edge switching elements.
FIG. 34 conceptually illustrates an example of such embodiments. Specifically,FIG. 34 illustrates that the managed switchingelement2510 does not perform a logical processing pipeline when sending the returning packet to the managed switchingelement2505. This figure also illustrates that the managed switchingelement2505, upon receiving a returning packet from the managed switchingelement2510, performs thelogical processing pipeline200 as if the managed switchingelement2505 were the first-hop switching element with respect to this returning packet.FIG. 34 is identical withFIG. 33 except the logical processing pipeline is depicted in the opposite direction (with arrows pointing to the left).FIG. 34 also illustrates arule3400 and a forwarding table3405.
Therule3400, in some embodiments, is a flow entry in the forwarding table3405 that is configured by a controller cluster (not shown) that manages the managednetwork switching element2510. Therule3400 specifies (or “hints”) that when the managed switchingelement2510 receives a packet originating from the managed switchingelement2505, the managed switchingelement2510 should not perform a logical processing pipeline on the returning packets to the managed switchingelement2505.
When the managed switchingelement2510 receives from the managed switching element2505 a packet on which the managed switchingelement2505 has performed a NAT operation, the managed switchingelement2510 finds therule3400 based on the information included in the packet's header (e.g., logical context). Also, the managed switchingelement2510, in some embodiments, modifies one or more other flow entries to indicate that no logical processing pipeline should be performed on packets from the destination machine (e.g., VM 4) of the received packet that are headed to the source machine (e.g., VM 1).
The managedswitching element2510 then forwards this packet to the destination machine, e.g.,VM 4. When the managed switchingelement2510 receives a returning packet fromVM 4 that is headed toVM 1, the managed switchingelement2510 will not perform a logical processing pipeline on this packet. That is, the managed switchingelement2510 will not perform logical forwarding at L2 or logical routing at L3. The managedswitching element2510 will simply indicate in the logical context for this packet that no logical processing has been performed on the packet.
When the managed switchingelement2505 receives this packet from the managed switchingelement2510, the managed switchingelement2505 performs thelogical processing pipeline200. Specifically, the managed switchingelement2505 first performs a logical context look up to determine the logical context of the packet based on the information included in the header fields of the packet. In this example, the source MAC address of the packet is a MAC address ofVM 4 and the source IP address of the packet is an IP address ofVM 4. Because the packet is sent fromVM 4 toVM 1 that is in a different logical network, the packet has a MAC address associated with port Y of thelogical switch230 as the destination MAC address (i.e., 01:01:01:01:01:02 in this example). The destination IP address of the packet is the NAT'ed IP address of VM 1 (i.e., 3.1.1.10).
The managedswitching element2505 then determines access control for the packet with respect to thelogical switch230. For instance, the managed switchingelement2505 determines that the packet does not have network addresses (e.g., source/destination MAC/IP addresses, etc.) that will cause thelogical switch230 to reject the packet that came throughport 2 of thelogical switch230. The managedswitching element2505 also identifies from the logical context that port Y of thelogical switch230 is the port to send out the packet. Furthermore, the managed switchingelement2505 determines access control for the packet with respect to port Y. For instance, the managed switchingelement2505 determines that the packet does not have network addresses that will cause thelogical switch230 not to send the packet through the port Y.
Next, the managed switchingelement2505 performs theNAT operation2645 on the packet to translate the destination IP address back to the IP address ofVM 1. That is, the managed switchingelement2505 in this example replaces 3.1.1.10 with 1.1.1.10 based on the NAT rules. The managedswitching element2505 then performs an L3 processing on the packet because the packet's destination IP address, now 1.1.1.10, is for another logical network. The managedswitching element2505 determines ingress access control for the packet at L3 with respect toport 2 of thelogical router225. The managedswitching element2505 also looks up the flow entries and determines that the packet is to be sent to thelogical port 1 of thelogical router225 because the destination IP address of the packet, 1.1.1.10, belongs to the subnet address of 1.1.1.1/24 that is associated with thelogical port 1 of thelogical router225. Furthermore, the managed switchingelement2505 determines egress access control for the packet with respect to thelogical port 1 of thelogical router225. The managedswitching element2505 also modifies the destination MAC address of the packet to be a MAC address ofVM 1.
The managedswitching element2505 then performs theL2 processing215. In this example, the source MAC address of the packet is now a MAC address oflogical port 1 of thelogical router225 and the source IP address of the packet is still the IP address ofVM 4. The destination IP address of the packet is the IP address of VM 1 (i.e., 1.1.1.10). In this example, the logical context specifies thatlogical switch220 is the logical switch that is to forward the packet and that logical port X of thelogical switch220 is the port through which the packet was received. The logical context also specifies thatport 1 of thelogical switch220 is the port through which to send the packet out to the destination,VM 1, becauseport 1 is associated with the MAC address ofVM 1.
The managedswitching element2505 then performs logical forwarding lookups based on the logical context of the packet, including determining ingress and egress access control with respect to port X andport 1 of thelogical switch220, respectively. The managedswitching element2505 performs a mapping lookup to determine a physical port to which thelogical port 1 of thelogical switch220 is mapped. In this example, the managed switchingelement2505 determines thatlogical port 1 of thelogical switch220 is mapped toport 4 of the managed switchingelement2505. The managedswitching element2505 then performs a physical lookup to determine operations for forwarding the packet to the physical port. In this example, the managed switchingelement2505 determines that the packet should be sent toVM 1 throughport 4 of the managed switchingelement2505.
FIG. 35 conceptually illustrates aprocess3500 that some embodiments perform to send a packet to a destination machine whose address is NAT'ed. Theprocess3500, in some embodiments, is performed by a managed edge switching element that receives a packet directly from a source machine.
Theprocess3500 begins by receiving (at3505) a packet from a source machine. The process then determines (at3510) whether the packet is headed to a destination machine whose address is NAT'ed. In some embodiments, the process determines whether the packet is headed to such destination machine by looking up flow entries that match the information included in the header of the packet (e.g., destination IP address). One or more flow entries specify that no logical processing (e.g., logical forwarding at L2 or logical routing at L3) should be performed on this packet when the packet is addressed to a destination machine whose address is NAT'ed. Other flow entries specify that logical processing should be performed when the packet is addressed to a destination machine whose address is not NAT'ed.
When theprocess3500 determines (at3510) that the packet is headed to a destination machine whose address is NAT'ed, theprocess3515 proceeds to3520 which will be described further below. When theprocess3500 determines (at3510) that the packet is headed to a destination machine whose address is not NAT'ed, theprocess3500 performs logical processing on the packet (e.g., logical forwarding at L2 and/or logical routing at L3).
Theprocess3500 then sends (at3520) the packet to the next hop managed switching element in route to the the destination machine. Theprocess3500 then ends.
Note above, the controllers are not involved in the per packet operations. The logical control plane only provisions the FIB rules identifying what should be network address translated. All per flow state is established by the datapath (Open vSwitch).
The embodiments described above utilize Source NAT'ing. However, some embodiments use Destination NAT'ing (DNAT'ing) along the same lines. In the case of DNAT'ing, all the processing can be done at the source managed edge switching element.
Moreover, in the case of placing the NAT functionality between the external and logical network, the operations are no different from the one described above. In this case, for the flows incoming from the external network, the NAT state will be held at the extender (which in this case would be the first-hop managed edge switching element) for both directions. On the other hand, for transport flows initiated towards the external network, the state will be held at the managed edge switching element attached to the originating host/VM.
With this purely distributed approach for the network address translation, VM mobility support requires migrating the established NAT state with the VM to the new hypervisor. Without migrating the NAT state, the transport connections will break. For such conditions, some embodiments are designed to expect the NAT to respond with TCP reset to packets sent to closed/non-existing TCP flows. More advanced implementations will integrate with the VM management system facilitating the migration of the NAT state together with the VM; in this case, the transport connections do not have to break.
FIG. 36 illustrates an example of migrating NAT state from a first host to a second host as a VM migrates from the first host to the second host. Specifically, this figure illustrates using a hypervisor of the first host to migrate the VM and the NAT state associated with the VM. The figure illustrates twohosts3600 and3630.
As shown, thehost3600 in this example is a source host from which aVM3625 is migrating to thehost3630. In thehost3600, aNAT daemon3610 and a managedswitching element3605 are running. TheNAT daemon3610 is similar to theNAT daemon3110 described above by reference toFIG. 31. TheNAT daemon3610 maintains the NAT table3115 which includes mappings of original and translated addresses. The managedswitching element3605 uses theNAT daemon3610 to obtain translated address. The managed switching element, in some embodiments, sends flow templates to theNAT daemon3610 to send original addresses and to obtain translated addresses as described above.
Thehypervisor3680 creates and manages VMs running in thehost3600. In some embodiments, thehypervisor3680 notifies the managed switchingelement3605 and/or theNAT daemon3610 of a migration of a VM running in thehost3600 out of thehost3600 before the VM migrates to another host. The managedswitching element3605 and/or theNAT daemon3610 gets such notifications by registering for callbacks in the event of a VM migration in some embodiments.
In some such embodiments, the managed switchingelement3605 asks the NAT daemon to fetch the NAT state (e.g., address mapping for the VM and protocol information, etc.) associated with the migrating VM and to provide the NAT state to thehypervisor3680. In some embodiments, theNAT daemon3610 provides the NAT state associated with the migrating VM to thehypervisor3680 when theNAT daemon3610 is directly notified of the migration by thehypervisor3680. Thehypervisor3680 then migrates the NAT state to the destination host along with the migrating VM.
In some embodiments, theNAT daemon3610 sends the NAT state associated with the migrating VM directly to the NAT daemon running in the destination host. In these embodiments, theNAT daemon3610 and/or the managed switchingelement3605 notifies thehypervisor3680 of the completion of the migration of the NAT state so that thehypervisor3680 can start migrating the VM to the destination host.
In some embodiments, the managed switchingelement3605 also provides the flow entries related to the migrating VM to thehypervisor3680 or to the managed switching element running in the destination host. When thehypervisor3680 is provided with the flow entries, thehypervisor3680 sends the flow entries to the flow table of the managed switching element running in the destination host. The migration of flow entries to the destination host is optional since the NAT state alone will enable the managed switching element running in the destination host to obtain translated addresses for the migrating VM.
An example operation of thesource host3600 will now be described. When thehypervisor3680 is to migrate VM3625 (e.g., per user input or inputs from a control cluster), thehypervisor3680 notifies the managed switchingelement3605. The managedswitching element3605 in this example then asks theNAT daemon3610 to fetch the NAT state associated withVM3625 and send the fetched state to thehypervisor3680.
Thehypervisor3680 then migrates theVM3625 to thedestination host3630 by moving the data of the VM. In some embodiments, thehypervisor3680 is capable of live migration by capturing the running state of theVM3625 and sending the state to theVM3625. Thehypervisor3680 also moves the fetched NAT state to the NAT table3645 of thehost3630 so that the managed switchingelement3635 running in thehost3630 can obtain translated addresses from theNAT daemon3640 forVM3625 just migrated into thehost3630.
FIG. 37 illustrates another example of migrating NAT state from a first host to a second host as a VM migrates from the first host to the second host. Specifically, this figure illustrates using a control cluster to ask a hypervisor of the first host to fetch the NAT state associated with the migrating VM and to send the NAT state to the second host. The figure illustrates twohosts3600 and3630. However, ahypervisor3680 running in thehost3600 in this example does not support notifications to the managed switching element or the NAT daemon running in the source host.
Because thehypervisor3680 of some embodiments does not notify the managed switching element or the NAT daemon of a migration of a VM to a destination host, the NAT state associated with the migrating VM is sent to the destination host after the hypervisor3680 starts or completes migrating a VM to the destination host. In particular, the managed switchingelement3635, in some embodiments, would detect migration ofVM3625 by, e.g., detecting the MAC address of3625 that is new to the managed switchingelement3635. The managedswitching element3635 notifies thecontrol cluster3705 the addition of VM3625 (therefore a new port of the managed switchingelement3635 for the VM3625).
Thecontrol cluster3705 is similar to thecontrol clusters1105 and2205 described above. Upon receiving the notification from the managed switchingelement3635 of the addition of VM, thecontrol cluster3705 asks thehypervisor3680 running in thesource host3600 to fetch the NAT state associated with the migratedVM3625 and update the NAT table3645 with the fetched NAT state. In some embodiments, thecontrol cluster3705 additionally asks to fetch flow entries associated with the migratedVM3625 and put those flow entries in the flow table3650 of thedestination host3630.
In some embodiments, thecontrol cluster3705 may directly ask the managed switching element and/or theNAT daemon3610 to send the NAT state and/or flow entries to theNAT daemon3640 and/or the managed switchingelement3635 so that the NAT table3645 and/or3650 are updated with the NAT state and/or flow entries associated with the migratedVM3625.
An example operation of thesource host3600, thedestination host3630, and thecontrol cluster3705 will now be described. When thehypervisor3680 is to migrate VM3625 (e.g., per user input or inputs from a control cluster), thehypervisor3680 migrates theVM3625 by moving the configuration data or the running state of theVM3625 to thehost3630. TheVM3625, now running in thehost3630, sends a packet to the managed switchingelement3635. The managedswitching element3635 in this example detects the migration ofVM3625 to thehost3630 by recognizing that the source MAC address of the packet is new to the managed switchingelement3635. The managedswitching element3605 in this example then notifies thecontrol cluster3705 of the addition of VM3625 (or, a creation of a new port for the VM3625).
Thecontrol cluster3705 then asks thehypervisor3680 to fetch the NAT state associated withVM3625 and to send the NAT state to thedestination host3630. The managedswitching element3635 running in thedestination host3630 can obtain translated addresses from theNAT daemon3640 forVM3625 that has just migrated into thehost3630.
IV. Load-Balancing
Some embodiments implement load balancing as an extra step in the L3 pipeline. For instance, some embodiments implement a logical bundle based load-balancing step followed by a destination network address translation. In some embodiments, the logical router (that provides the load-balance service) hosts the virtual IP address, and hence will respond to the ARP requests sent to the virtual IP address (VIP). With this, the virtual IP will remain functional even if the traffic is sent to the VIP from the same L2 domain in which the cluster members exist.
FIG. 38 illustrates an example physical implementation of logical switches and a logical router that performs load balancing. In particular, this figure illustrates a centralized L3 routing model in which the logical router is implemented by an L3 router or a managed switching element based on flow entries. This figure illustrates managed switching elements3805-3825 and VMs3830-3850. This figure also illustrates a logical processing pipeline that includesL2 processing3855, DNAT and load balancing3860,L3 routing3865, andL2 processing3870 and3875.
The managedswitching element3805 of some embodiments is a second-level managed switching element functioning as an extender. The managedswitching element3805 in some such embodiments is similar to the managed switchingelements810 and1910 described above in that the managed switchingelement3805 implements a logical router (not shown) based on flow entries (not shown) or is running in the same host on which an L3 router that implements the logical router is running. In addition, the managed switchingelement3805 performs DNAT and load balancing3860 to translate a destination address into another address and balance the load among different machines (e.g., VMs) that provide the same service (e.g., a web service).
The managed switching elements3805-3825 implement logical switches (not shown) to which VMs3830-3850 are connected. TheVMs3840 and3850 in this example provide the same service. That is, theVMs3840 and3850, in some embodiments, collectively act as a server that provides the same service. However, theVMs3850 and3850 are separate VMs that have different IP addresses. The managedswitching element3805 or the L3 router (not shown) used by the managed switchingelement3805 perform a load balancing to distribute workload among theVMs3840 and3850.
In some embodiments, load balancing is achieved by translating the destination address of the packets requesting the service into different addresses of the VMs providing the service. In particular, the managed switchingelement3805 or the L3 router (not shown) used by the managed switchingelement3805 translates the destination addresses of the request packets into addresses of theseveral VMs3840 and3850 such that no particular VM of the VMs gets too much more workload than the other VMs do. More details about finding the current workload of the service-providing VMs will be described further below.
In some embodiments, the managed switchingelement3805 or the L3 router perform anL3 routing3865 after performing DNAT and load balancing3860 of the logical processing pipeline. Therefore, the managed switchingelement3805 or the L3 router route the packets to different managed switching elements based on the translated destination addresses in these embodiments. The managedswitching elements3820 and3825 are edge switching elements and thus send and receive packets to and from theVMs3840 and3850 directly. In other embodiments, the managed switchingelement3805 or the L3 router performs theL3 routing3865 before performing DNAT and load balancing3860 of the logical processing pipeline.
An example operation of the managed switchingelement3805 will now be described. The managedswitching element3810 receives a packet requesting a service collectively provided by theVMs3840 and3850. This packet comes from one ofVM3830, specifically, from an application that uses a particular protocol. The packet in this example includes a protocol number that identifies the particular protocol. The packet also includes an IP address that represents the server providing the service as destination IP address. The details of performingsource L2 processing3855 on this packet are omitted for simplicity of description because it is similar to the source L2 processing examples described above and below.
After thesource L2 processing3855 is performed to route the packet to the managed switchingelement3805 for performing an L3 processing that includesL3 routing3865. In this example, the managed switchingelement3805 performs the DNAT and load balancing3860 on the packet. That is, the managed switchingelement3805 translates the destination IP address of the packet into an IP address of one of the VMs that provides the service. In this example, the managed switchingelement3805 selects one of VMs3840-3850 that has the least workload among the VMs3840-3850. The managedswitching element3805 performsL3 routing3865 on the packet (i.e., routes the packet) based on the new destination IP address.
The managedswitching element3820 receives the packet because the destination IP address is of one of theVMs3840 and this destination IP is resolved into the MAC address of the VM. The managedswitching element3820 forwards the packet to the VM. This VM will return packets to the application that originally requested the service. These returning packets will reach the managed switchingelement3805 and the managed switchingelement3805 will perform NATs and identify that the application is the destination of these packets.
FIG. 39 illustrates another example physical implementation of logical switches and a logical router that performs load balancing. In particular, this figure illustrates a distributed L3 routing model in which the logical router is implemented by a managed switching element that also performs source and destination L2 processing. That is, this managed switching element performs the entire logical processing pipeline. This figure illustrates managed switchingelements3905 and3820-3825 andVMs3910 and3840-3850. This figure also illustrates a logical processing pipeline that includes theL2 processing3855, the DNAT and load balancing3860, theL3 routing3865, and the L2 processing3870-3875.
The managedswitching element3905 of some embodiments is similar to the managed switchingelements2505 described above by reference toFIG. 29 in that the managed switchingelement3905 implements the entire logical processing pipeline. That is, the managed switchingelement3905 implements the logical router and logical switches. In addition, the managed switchingelement3905 performs DNAT and load balancing3860 to translate a destination address into another address and balance the load among different machines (e.g., VMs) that provide the same service (e.g., a web service).
As mentioned above, the managed switchingelement3905 implements logical switches (not shown) to whichVMs3910 and3840-3850 are connected. The managedswitching element3905 also performs a load balancing to distribute workload among theVMs3840 and3850. In particular, the managed switchingelement3905 translates the destination addresses of the request packets into addresses of theseveral VMs3840 and3850 such that no particular VM of the VMs gets too much more workload than the other VMs do. More details about finding current workload of the service-providing VMs will be described further below.
In some embodiments, the managed switchingelement3905 performs anL3 routing3865 after performing DNAT and load balancing3860 of the logical processing pipeline. Therefore, the managed switchingelement3905 routes the packets to different managed switching elements based on the translated destination addresses. The managedswitching elements3820 and3825 are edge switching elements and thus send and receive packets to and from theVMs3840 and3850 directly. In other embodiments, the managed switchingelement3905 performs theL3 routing3865 before performing DNAT and load balancing3860 of the logical processing pipeline.
The operation of the managed switchingelement3905 would be similar to the example operation described above by reference toFIG. 38, except that the managed switchingelement3905 performs the entire logical processing pipeline including the DNAT and load balancing3860.
FIG. 40 illustrates yet another example physical implementation of logical switches and a logical router that performs load balancing. In particular, this figure illustrates a distributed L3 routing model in which the logical router is implemented by a managed switching element that also performs source L2 processing. That is, this managed switching element as a first-hop managed switching element performs the source L2 processing and the L3 processing. The destination L2 processing is performed by another managed switching element that is a last-hop managed switching element. This figure illustrates managed switchingelements4005 and3820-3825 andVMs4010 and3840-3850. This figure also illustrates a logical processing pipeline that includes theL2 processing3855, the DNAT and load balancing3860, theL3 routing3865, and the L2 processing3870-3875.
The managedswitching element4005 of some embodiments is similar to the managed switchingelements2505 described above by reference toFIG. 46 in that the managed switchingelement4005 performs the source L2 processing and the L3 processing of the logical processing pipeline. That is, the managed switchingelement4005 implements the logical router and a logical switch that is connected to a source machine. In addition, the managed switchingelement4005 performs DNAT and load balancing3860 to translate destination address into another address and balance the load among different machines (e.g., VMs) that provide the same service (e.g., a web service).
As mentioned above, the managed switchingelement4005 implements a logical switch (not shown) to which one or more ofVMs4010 are connected. The managedswitching element4005 also performs a load balancing to distribute workload among theVMs3840 and3850. In particular, the managed switchingelement4005 translates the destination addresses of the request packets into addresses of theseveral VMs3840 and3850 such that no particular VM of the VMs gets too much more workload than the other VMs do. More details about finding the current workload of the service-providing VMs will be described further below.
In some embodiments, the managed switchingelement4005 performs anL3 routing3865 after performing DNAT and load balancing3860 of the logical processing pipeline. Therefore, the managed switchingelement4005 routes the packets to different managed switching elements based on the translated destination addresses. The managedswitching elements3820 and3825 are edge switching elements and thus send and receive packets to and from theVMs3840 and3850 directly. In other embodiments, the managed switchingelement4005 performs theL3 routing3865 before performing DNAT and load balancing3860 of the logical processing pipeline.
The operation of the managed switchingelement4005 would be similar to the example operation described above by reference toFIG. 38, except that different managed switching elements perform different portions of the logical processing pipeline.
FIG. 41 conceptually illustrates a load balancing daemon that balances load among the machines that collectively provides a service (e.g., web service). Specifically, this figure illustrates that a managed switching element that runs a logical processing pipeline to logically forward and route packets uses a load balancing daemon for balancing workload among the machines providing the service. This figure illustrates ahost4100, a managedswitching element4105, a forwarding table4120, aload balancing daemon4110, and an connection table4115 in the top half of the figure. This figure illustratesflow entries4125 and4130.
Theflow entries4125 and4130 each has a qualifier and an action. The text illustrated asflow entries4125 and4130 may not be in an actual format. Rather, the text is just a conceptual illustration of a qualifier and an action pair. Thehost4100, in some embodiments, is a machine operated by an operating system (e.g., Windows™ and Linux™) that is capable of running a set of software applications. The managedswitching element4105 of some embodiment is a software switching element (e.g., Open vSwitch) that executes in thehost4100. As mentioned above, a controller cluster (not shown) configures a managed switching element by supplying flow entries that specify the functionality of the managed switching element. The managedswitching element4105 of some embodiments does not itself generate flow entries.
The managedswitching element4105 of some embodiments runs all or part of the logical processing pipeline described above by reference toFIGS. 38-40. In particular, the managed switchingelement4105 performs the L3 processing to route packets received from the machines if necessary, based on flow entries in the forwarding table4120. In some embodiments, the managed switchingelement4105 is an edge switching element that receives a packet from a machine (not shown) that is coupled to the managed switching element. In some such embodiments, one or more virtual machines (not shown) are running in thehost4100 and are coupled to the managed switchingelements4105.
When the managed switchingelement4105 is configured to perform load balancing, the managed switchingelement4105 of some embodiments uses theload balancing daemon4110 for performing load balancing on packets. Theload balancing daemon4110 is similar to theNAT daemon3110 in that theload balancing daemon4110 provides a translated destination address (e.g., a destination IP address). In addition, theload balancing daemon4110 selects a destination into which to translate the original destination address based on the current load of the machines, the IP addresses of which are included in the table4115.
Theload balancing daemon4110 of some embodiments is a software application running on thehost4100. Theload balancing daemon4110 maintains the connection table4115 which includes pairings of connection identifiers and available addresses of the machines that provide the service. Though not depicted, the connection table4115 of some embodiments may also include the current workload quantified for a machine associated with an address. In some embodiments, theload balancing daemon4110 periodically communicates with the VMs providing the service to get the updated state of the VMs, including the current workload on the VMs.
When the managed switchingelement4105 asks for an address to select based on connection identifiers, the load balancing daemon, in some embodiments, looks up the table4115 to find the address into which the given destination address should be translated. In some embodiments, the load balancing daemon runs a scheduling method to identify a server VM in order to balance the load among the server VMs. Such a scheduling algorithm considers the current load on the machine associated with the address. More details and examples of load balancing methods are described in the U.S. Provisional Patent Application 61/560,279, which is incorporated herein by reference.
The connection identifiers uniquely identify a connection between the requester of the service (i.e., the origin or source of the packet) and the machine that ends up providing the requested service so that the packets returning from the machine can be accurately relayed back to the requester. The source IP addresses of these returning packets will be translated back to an IP address (referred to as “virtual IP address”) that represents a server providing the service. The mapping between these connection identifiers will also be used for the packets that are subsequently sent from the source. In some embodiments, the connection identifiers include a source port, a destination port, a source IP address, a destination IP address, a protocol identifier, etc. The source port is a port from which the packet was sent (e.g., a TCP port). The destination port is a port to which the packet is to be sent. The protocol identifier identifies the type of protocol (e.g., TCP, UDP, etc.) used for formatting the packet.
The managedswitching element4105 and theload balancing daemon4110 of different embodiments use different techniques to ask for and supply addresses. For instance, the managed switchingelement4105 of some embodiments sends a packet, which has an original address but does not have a translated address, to the load balancing daemon. Theload balancing daemon4110 of these embodiments translates the original address into a translated address. Theload balancing daemon4110 sends the packet back to the managed switchingelement4105, which will perform logical forwarding and/or routing to send the packet towards the destination machine. In some embodiments, the managed switchingelement4105 initially sends metadata, along the with packet that contains an original address to resolve, to theload balancing daemon4110. This metadata includes information (e.g., register values, logical pipeline state, etc.) that the managed switchingelement4105 uses to resume performing the logical processing pipeline when the managed switchingelement4105 receives the packet back from theload balancing daemon4110.
In other embodiments, the managed switchingelement4105 of some embodiments requests an address by sending a flow template, which is a flow entry that does not have actual values for the addresses, to theload balancing daemon4110. The load balancing daemon finds out the addresses to fill in the flow template by looking up the table4115. Theload balancing daemon4110 then sends the flow template that is filled in with actual addresses back to the managed switchingelement4110 by putting the filled-in flow template into the forwarding table4120. In some embodiments, the load balancing daemon assigns to the filled-in flow template a priority value that is higher than the priority value of the flow template that is not filled in. Moreover, when theload balancing daemon4110 fails to find a translated address, the load balancing daemon would specify in the flow template to drop the packet.
An example operation of the managed switchingelement4105 and theload balancing daemon4110 will now be described in terms of three different stages 1-3 (encircled 1-3). In this example, the managed switchingelement4115 is a managed edge switching element that receives a packet to forward and route from a machine (not shown). In particular, the packet in this example is a request for a service. The packet has an IP address that represents a server that provides the requested service.
The managedswitching element4105 receives this packet and performs the L3 processing based on the flow entries in the forwarding table4120. While performing theL3 processing210 on the packet, the managed switching element4105 (at stage 1) identifies theflow entry4125 and performs the action specified in theflow entry4125. As shown, theflow entry4125 indicates that a flow template having connection identifiers should be sent to theload balancing daemon4110 to have theload balancing daemon4110 to provide a new destination IP address. In this example, theflow entry4125 has a priority value of N, which is a number in some embodiments.
Atstage 2, theload balancing daemon4110 receives the flow template and finds out that the destination IP address of a packet that has the specified connection IDs is to be translated into 2.1.1.10 by looking up the connection table4115 and by running a scheduling algorithm. The load balancing daemon fills out the flow template and inserts the filled-in template (now the flow entry4130) into the forwarding table4130. In this example, the load balancing daemon assigns a priority of N+1 to the filled-in template.
Atstage 3, the managed switchingelement4110 uses theflow entry4130 to change the destination IP address for the packet. Also, for the packets that the managed switchingelement4110 subsequently processes, the managed switchingelement4105 usesflow entry4130 over theflow entry4125 when a packet has the specified connection identifiers.
In some embodiments, theload balancing daemon4110 and the managed switching element run in a same virtual machine that is running on thehost4100 or in different virtual machines running on thehost4100. Theload balancing daemon4110 and the managed switching element may also run in separate hosts.
V. DHCP
The virtualization application, in some embodiments, defines forwarding rules that route DHCP requests to a DHCP daemon that is running in a shared host. Using a shared host for this functionality avoids the extra cost of running a DHCP daemon per customer.
FIG. 42 illustrates a DHCP daemon that provides DHCP service to different logical networks for different users. This figure illustrates in the left half of the figure the implementation of example logical networks4201 and4202 for two different users A and B, respectively. An example physical implementation of the logical networks4201 and4202 is illustrated in the right half of the figure.
As shown in the left half of the figure, the logical network4201 includes alogical router4205 and twological switches4210 and4215.VMs4220 and4225 are connected to thelogical switch4210. That is,VMs4220 and4225 send and receive packets forwarded by thelogical switch4210.VM4230 is connected to thelogical switch4215. Thelogical router4205 routes packets between thelogical switches4210 and4215. Thelogical router4205 is also connected to aDHCP Daemon4206 which provides DHCP service to the VMs in the logical network4201, which are VMs of the user A.
The logical network4202 for the user B includes alogical router4235 and twological switches4240 and4245.VMs4250 and4255 are connected to thelogical switch4240.VM4260 is connected to thelogical switch4245. Thelogical router4235 routes packets between thelogical switches4240 and4245. Thelogical router4235 is also connected to aDHCP Daemon4236 which provides DHCP service to the VMs in the logical network4202, which are VMs of the user B.
In the logical implementation shown in the left half of the figure, each logical network for a user has its own DHCP daemon. In some embodiments, theDHCP daemons4206 and4236 may be physically implemented as separate DHCP daemons running in different hosts or VMs. That is, each user would have a separate DHCP daemon for the user's machines only.
In other embodiments, the DHCP daemons for different users may be physically implemented as a single DHCP daemon that provides DHCP service to VMs of different users. That is, different users share the same DHCP daemon. TheDHCP daemon4270 is a shared DHCP daemon that serves VMs of both users A and B. As shown in the right half of figure, the managed switching elements4275-4285 that implement thelogical routers4205 and4235 and thelogical switches4210,4215,4240 and4245 for users A and B use thesingle DHCP daemon4270. Therefore, VMs4220-4260 of the users A and B use theDHCP daemon4270 to dynamically obtain an address (e.g., an IP address).
TheDHCP daemon4270 of different embodiments may run in different hosts. For instance, theDHCP daemon4270 of some embodiments runs in the same host (not shown) in which one of the managed switching elements4275-4285 is running. In other embodiments, theDHCP daemon4270 does not run in a host on which a managed switching element is running and instead runs in a separate host that is accessible by the managed switching elements.
FIG. 43 illustrates a central DHCP daemon and several local DHCP daemons. The central DHCP daemon provides DHCP service to VMs of different users through local DHCP daemons. Each local DHCP daemon maintains and manages a batch of addresses to offload the central DHCP daemon's service to the local DHCP daemons. This figure illustrates an example architecture that includes acentral DHCP daemon4320 and twolocal DHCP daemons4330 and4350.
As shown, thecentral DHCP daemon4320 runs in ahost4305 in which a managed switching element4306 also runs. The managed switching element4306 of some embodiments is a second-level managed switching element functioning as a pool node for managed switchingelements4340 and4360. Thecentral DHCP daemon4320 provides DHCP services todifferent VMs4345 and4365 of different users. In some embodiments, thecentral DHCP daemon4320 distributes the available addresses (e.g., IP addresses)4325 in batches of addresses to different local DHCP daemons includinglocal DHCP daemons4330 and4350 in order to offload the DHCP service to these local DHCP daemons. Thecentral DHCP daeon4320 provides more addresses to a local DHCP daemon when the local DHCP daemon runs out of available address to assign in its own batch of addresses.
Thelocal DHCP daemon4330 runs in a host4310 in which a managedswitching element4340 also runs. The managedswitching element4340 is an edge switching element that directly sends and receives packets to and fromVMs4345. The managedswitching element4340 implements one or more logical switches and logical routers of different users. That is, theVMs4345 may belong to different users. Thelocal DHCP daemon4330 provides DHCP service toVMs4345 using the batch ofaddresses4335 that thelocal DHCP daemon4330 obtains from thecentral DHCP daemon4320. Thelocal DHCP daemon4330 resorts to thecentral DHCP daemon4320 when thelocal DHCP daemon4330 runs out of available addresses to assign in the batch ofaddresses4335. In some embodiments, thelocal DHCP daemon4330 communicates with thecentral DHCP daemon4320 via the managed switchingelements4340 and4306. The managedswitching elements4340 and4306 has a tunnel established between them in some embodiments.
Similarly, thelocal DHCP daemon4350 runs in ahost4315 in which a managedswitching element4360 also runs. The managedswitching element4360 is an edge switching element that directly sends and receives packets to and fromVMs4365. The managedswitching element4360 implements one or more logical switches and logical routers of different users. Thelocal DHCP daemon4350 provides DHCP service toVMs4365 using the batch ofaddresses4355 that thelocal DHCP daemon4350 obtains from thecentral DHCP daemon4320. In some embodiments, the batch ofaddresses4355 does not include addresses that are in the batch ofaddresses4335 that are allocated to the local DHCP daemon running in the host4310. Thelocal DHCP daemon4350 also resorts to thecentral DHCP daemon4320 when thelocal DHCP daemon4350 runs out of available addresses to assign in its own batch ofaddresses4355. In some embodiments, thelocal DHCP daemon4350 communicates with thecentral DHCP daemon4320 via the managed switchingelements4360 and4306. The managedswitching elements4360 and4306 have a tunnel established between them in some embodiments.
VI. Interposing Service VMs
In the discussion above, various L3 services that are provided by the virtualization application of some embodiments were described. To maximize the network control system's flexibility, some embodiments interpose service machines that provide similar functionality to those provided by the “middleboxes” that users use today in the physical networks.
Accordingly, the network control system of some embodiments includes at least one “middlebox” VM that is attached to a LDPS of a logical network. Then the pipeline state of the LDP sets is programmed by the control application (that populates the logical control plane) so that the relevant packets are forwarded to the logical port of this VM. After the VM has processed the packet, the packet is sent back to the logical network so that its forwarding continues through the logical network. In some embodiments, the network control system utilizes many such “middlebox” VMs. The middlebox VMs interposed in this manner may be very stateful and implement features well beyond the L3 services described in this document.
VII. Scalability
The scalability implications of the logical L3 switching design of some embodiments along three dimensions are addressed below. These three dimensions are: (1) logical state, (2) physical tunneling state, and (3) distributed binding lookups. Most of the logical pipeline processing occurs at the first hop. This implies that all the logical (table) state, of all interconnected LDP sets, is disseminated, in some embodiments, to everywhere in the network where the pipeline execution may take place. In other words, the combined logical state of all interconnected LDP sets is disseminated to every managed edge switching element attached to any of these LDP sets in some embodiments. However, in some embodiments, the “meshiness” of the logical topology does not increase the dissemination load of the logical state.
To limit the state dissemination, some embodiments balance the pipeline execution between the source and destination devices so that the last LDPS pipeline would be executed not at the first hop but at the last hop. However, in some cases, this may result in not disseminating enough state for every managed switching element to do the logical forwarding decision of the last LDPS; without that state, the source managed switching elements might not even be able to deliver the packets to the destination managed switching elements. Accordingly, some embodiments will constrain the general LDPS model, in order to balance the pipeline execution between the source and destination devices.
The logical state itself is not likely to contain more than at most O(N) entries (N is the total number of logical ports in the interconnected LDP sets) as the logical control plane is designed, in some embodiments, to mimic the physical control planes that are used today, and the physical control planes are limited by the capabilities of existing hardware switching chipsets. Therefore, disseminating the logical state might not be the primary bottleneck of the system but eventually it might become one, as the logical control plane design grows.
Some embodiments partition the managed switching elements of a network into cliques interconnected by higher-level aggregation switching elements. Instead of implementing partitioning to reduce logical state with an “everything on the first-hop” model, some embodiments partition to reduce the tunneling state, as discussed below. Examples of cliques are described in the above-mentioned U.S. patent application Ser. No. 13/177,535. This application also describes various embodiments that perform all or most of the logical data processing at the first-hop, managed switching elements.
The physical tunneling state maintained in the whole system is O(N2) where N is the number of logical ports in the interconnected LDP sets total. This is because any managed edge switching element with a logical port has to be able to directly send the traffic to the destination managed edge switching element. Therefore, maintaining tunneling state in an efficient manner, without imposing O(N2) load to any centralized control element becomes even more important than with pure L2 LDP sets. The aggregation switching elements are used, in some embodiments, to slice the network into cliques. In some of these embodiments, the packet is still logically routed all the way in the source managed edge switching element but instead of tunneling it directly to the destination edge switching element, it is sent to a pool node that routes it towards the destination based on the destination MAC address. In essence, the last L2 LDPS spans multiple cliques, and pool nodes are used to stitch together portions of that L2 domain.
FIGS. 44-45B illustrate a distributed logical router implemented in several managed switching elements based on flow entries of the managed switching elements. In particular,FIGS. 44-45B illustrate that some of the destination L2 processing is performed by a last hop managed switching element (i.e., the switching element that sends a packet directly to a destination machine).
FIG. 44 conceptually illustrates an example of performing some logical processing at the last hop switching element. Specifically,FIG. 44 illustrates that the managed switchingelement2505 that is coupled to a source machine for a packet performs most of thelogical processing pipeline200 and the managed switchingelement2510 that is coupled to a destination machine performs some of thelogical processing pipeline200. The figure illustrates thelogical router225 and thelogical switches220 and230 in the left half of the figure. This figure illustrates the managed switchingelements2505 and2510 in the right half of the figure. The figure illustrates VMs 1-4 in both the right and the left halves of the figure.
In some embodiments, a managed switching element does not keep all the information (e.g., flow entries in lookup tables) to perform the entirelogical processing pipeline200. For instance, the managed switching element of these embodiments does not maintain the information for determining access control with respect to a logical port of the destination logical network through which to send the packet to the destination machine of the packet.
An example packet flow along the managed switchingelements2505 and2510 will now be described. WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505. The managedswitching element2505 then performs theL2 processing205 and theL3 processing210 on the packet.
The managedswitching element2505 then performs a portion of theL2 processing215. Specifically, the managed switchingelement2505 determines access control for the packet. For instance, the managed switchingelement2505 determines that the packet does not have network addresses (e.g., source/destination MAC/IP addresses, etc.) that will cause thelogical switch230 to reject the packet that came through port Y of thelogical switch230. The managedswitching element2505 then determines thatport 1 of thelogical switch230 is the port through which to send the packet out to the destination,VM 4. However, the managed switchingelement2505 does not determine access control for the packet with respect toport 1 of thelogical switch230 because the managed switchingelement2505, in some embodiments, does not have information (e.g., flow entries) to perform theegress ACL2670.
The managedswitching element2505 then performs a mapping lookup to determine a physical port to which thelogical port 1 of thelogical switch230 is mapped. In this example, the managed switchingelement2505 determines thatlogical port 1 of thelogical switch230 is mapped toport 5 of the managed switchingelement2510. The managedswitching element2505 then performs a physical lookup to determine operations for forwarding the packet to the physical port. In this example, the managed switchingelement2505 determines that the packet should be sent toVM 4 throughport 5 of the managed switchingelement2510. The managedswitching element2505 in this example modifies the logical context of the packet before sending it out along with the packet toVM 4.
The managedswitching element2505 sends the packet to the managed switchingelement2510. In some cases, the managed switchingelement2505 sends the packet over the tunnel that is established between the managed switchingelements2505 and2510 (e.g., the tunnel that terminates atport 3 of the managed switchingelement2505 andport 3 of the managed switching element2510). When the tunnel is not available, the managed switchingelements2505 sends the packet to a pool node (not shown) so that the packet can reach the managed switchingelement2510.
When the managed switchingelement2510 receives the packet, the managed switchingelement2510 performs theegress ACL2670 on the packet based on the logical context of the packet (the logical context would indicate that it is theegress ACL2670 that is left to be performed on the packet). For instance, the managed switchingelement2510 determines that the packet does not have network addresses that will cause thelogical switch230 not to send the packet through theport 1 of thelogical switch230. The managedswitching element2510 then sends the packet toVM 4 throughport 5 of the managed switchingelement2510 as determined by the managed switchingelement2505 that performed theL2 processing215.
FIGS. 45A-45B conceptually illustrate an example operation of thelogical switches220 and230, thelogical router225, and the managed switchingelements2505 and2510 described above by reference toFIG. 44. Specifically,FIG. 45A illustrates an operation of the managed switchingelement2505, which implements thelogical router225,logical switch220, and a portion oflogical router230.FIG. 45B illustrates an operation of the managed switchingelement2510 that implements a portion oflogical switch230.
As shown in the bottom half ofFIG. 45A, the managed switchingelement2505 includesL2 entries4505 and4515 andL3 entries4510. These entries are flow entries that a controller cluster (not shown) supplies to the managed switchingelement2505. Although these entries are depicted as three separate tables, the tables do not necessarily have to be separate tables. That is, a single table may include all these flow entries.
WhenVM 1 that is coupled to thelogical switch220 sends apacket4530 toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505 throughport 4 of the managed switchingelement2505. The managedswitching element2505 performs an L2 processing on the packet based on the forwarding tables4505-4515 of the managed switchingelement2505. In this example, thepacket4530 has a destination IP address of 1.1.2.10, which is the IP address ofVM 4. Thepacket4530's source IP address is 1.1.1.10. Thepacket4530 also hasVM 1's MAC address as a source MAC address and the MAC address of the logical port 1 (e.g., 01:01:01:01:01:01) of thelogical router225 as a destination MAC address.
The operation of the managed switchingelement2505 until the managed switching element identifies an encircled 9 and performs L2logical processing2665 is similar to the operation of the managed switchingelement2505 in the example ofFIG. 30A, except that the managed switchingelement2505 in the example ofFIG. 45A is performed onpacket4530.
Based on the logical context and/or other fields stored in thepacket4530's header, the managed switchingelement2505 then identifies a record indicated by an encircled 10 (referred to as “record 10”) in theL2 entries4515 that implements the context mapping of thestage2675. In this example, therecord 10 identifiesport 5 of the managed switchingelement2510 to whichVM 4 is coupled as the port that corresponds to the logical port (determined at stage2665) of thelogical switch230 to which thepacket4530 is to be forwarded. Therecord 10 additionally specifies that thepacket4530 be further processed by the forwarding tables (e.g., by sending thepacket4530 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket4530's header, the managed switchingelement2505 then identifies a record indicated by an encircled 11 (referred to as “record 11”) in theL2 entries4515 that implements the physical mapping of thestage2680. The record 11 specifiesport 3 of the managed switchingelement2505 as a port through which thepacket4530 is to be sent in order for thepacket4530 to reach the managed switchingelement2510. In this case, the managed switchingelement2505 is to send thepacket4530 out ofport 3 of managed switchingelement2505 that is coupled to the managed switchingelement2510.
As shown inFIG. 45B, the managed switchingelement2510 includes a forwarding table that includes rules (e.g., flow entries) for processing and routing thepacket4530. When the managed switchingelement2510 receives thepacket4530 from the managed switchingelement805, the managed switchingelement2510 begins processing thepacket4530 based on the forwarding tables of the managed switchingelement2510. The managedswitching element2510 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping. Therecord 1 identifies thepacket4530's logical context based on the logical context that is stored in thepacket4530's header. The logical context specifies that thepacket4530 has been processed up to thestage2665 by the managed switchingelement805. As such, therecord 1 specifies that thepacket4530 be further processed by the forwarding tables (e.g., by sending thepacket4530 to a dispatch port).
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket4530's header, a record indicated by an encircled 2 (referred to as “record 2”) in the forwarding tables that implements the egress ACL. In this example, therecord 2 allows thepacket4530 to be further processed and, thus, specifies thepacket4530 be further processed by the forwarding tables (e.g., by sending thepacket4530 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement2510 store the logical context (i.e., thepacket4530 has been processed for L2 egress ACL of the logical switch230) of thepacket4530 in the set of fields of thepacket4530's header.
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket4530's header, a record indicated by an encircled 3 (referred to as “record 3”) in the forwarding tables that implements the physical mapping. Therecord 3 specifies theport 5 of the managed switchingelement2510 through which thepacket4530 is to be sent in order for thepacket4530 to reachVM 4. In this case, the managed switchingelement2510 is to send thepacket4530 out ofport 5 of managed switchingelement2510 that is coupled toVM 4. In some embodiments, the managed switchingelement2510 removes the logical context from thepacket4530 before sending the packet toVM 4.
FIGS. 46-47B illustrate a distributed logical router implemented in several managed switching elements based on flow entries of the managed switching elements. In particular,FIGS. 46-47B illustrate that thesource L2 processing205 andL3 processing210 are performed by a first hop managed switching element (i.e., the switching element that receives a packet directly from a source machine) and the entiredestination L2 processing215 is performed by a last hop managed switching element (i.e., the switching element that sends a packet directly to a destination machine).
FIG. 46 conceptually illustrates an example of performing some logical processing at the last hop switching element. Specifically,FIG. 46 illustrates that the managed switchingelement2505 that is coupled to a source machine for a packet performs theL2 processing205 and theL3 processing210 and the managed switchingelement2510 that is coupled to a destination machine performs theL2 processing215. That is, the managed switchingelement2505 performs L2 forwarding for the source logical network and the L3 routing and the L2 forwarding for the destination logical network is performed by the managed switchingelement2510. The figure illustrates thelogical router225 and thelogical switches220 and230 in the left half of the figure. This figure illustrates the managed switchingelements2505 and2510 in the right half of the figure. The figure illustrates VMs 1-4 in both the right and the left halves of the figure.
In some embodiments, a managed switching element does not keep all the information (e.g., flow entries in lookup tables) to perform the entirelogical processing pipeline200. For instance, the managed switching element of these embodiments does not maintain the information for performing logical forwarding for the destination logical network on the packet.
An example packet flow along the managed switchingelements2505 and2510 will now be described. WhenVM 1 that is coupled to thelogical switch220 sends a packet toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505. The managedswitching element2505 then performs theL2 processing205 and theL3 processing210 on the packet.
The managedswitching element2505 sends the packet to the managed switchingelement2510. In some cases, the managed switchingelement2505 sends the packet over the tunnel that is established between the managed switchingelements2505 and2510 (e.g., the tunnel that terminates atport 3 of the managed switchingelement2505 andport 3 of the managed switching element2510). When the tunnel is not available, the managed switchingelements2505 sends the packet to a pool node (not shown) so that the packet can reach the managed switchingelement2510.
When the managed switchingelement2510 receives the packet, the managed switchingelement2510 performs theL2 processing215 on the packet based on the logical context of the packet (the logical context would indicate that it is the entire L2 processing215 that is left to be performed on the packet). The managedswitching element2510 then sends the packet toVM 4 throughport 5 of the managed switchingelement2510.
FIGS. 47A-47B conceptually illustrate an example operation of thelogical switches220 and230, thelogical router225, and the managed switchingelements2505 and2510 described above by reference toFIG. 46. Specifically,FIG. 47A illustrates an operation of the managed switchingelement2505, which implements thelogical switch220 and thelogical router225.FIG. 47B illustrates an operation of the managed switchingelement2505 that implements thelogical switch230.
As shown in the bottom half ofFIG. 47A, the managed switchingelement2505 includesL2 entries4705 andL3 entries4710. These entries are flow entries that a controller cluster (not shown) supplies to the managed switchingelement2505. Although these entries are depicted as two separate tables, the tables do not necessarily have to be separate tables. That is, a single table may include all these flow entries.
WhenVM 1 that is coupled to thelogical switch220 sends apacket4730 toVM 4 that is coupled to thelogical switch230, the packet is first sent to the managed switchingelement2505 throughport 4 of the managed switchingelement2505. The managedswitching element2505 performs an L2 processing on the packet based on the forwarding tables4705-4710 of the managed switchingelement2505. In this example, thepacket4730 has a destination IP address of 1.1.2.10, which is the IP address ofVM 4. Thepacket4730's source IP address is 1.1.1.10. Thepacket4730 also hasVM 1's MAC address as a source MAC address and the MAC address of the logical port 1 (e.g., 01:01:01:01:01:01) of thelogical router225 as a destination MAC address.
The operation of the managed switchingelement2505 until the managed switching element identifies an encircled 7 and performs L3 egress ACL with respect to theport 2 of thelogical router225 is similar to the operation of the managed switchingelement2505 in the example ofFIG. 47A, except that the managed switchingelement2505 in the example ofFIG. 47A is performed onpacket4730.
Based on the logical context and/or other fields stored in thepacket4730's header, the managed switchingelement2505 then identifies a record indicated by an encircled 8 (referred to as “record 8”) in theL2 entries4710 that implements the physical mapping of thestage2680. Therecord 8 specifies that thelogical switch230 is implemented in the managed switchingelement2510 and the packet should be sent to the managed switchingelement2510.
Based on the logical context and/or other fields stored in thepacket4730's header, the managed switchingelement2505 then identifies a record indicated by an encircled 9 (referred to as “record 9”) in the L2 entries4715 that implements the physical mapping of thestage2680. Therecord 9 specifiesport 3 of the managed switchingelement2505 as a port through which thepacket4730 is to be sent in order for thepacket4730 to reach the managed switchingelement2510. In this case, the managed switchingelement2505 is to send thepacket4730 out ofport 3 of managed switchingelement2505 that is coupled to the managed switchingelement2510.
As shown inFIG. 47B, the managed switchingelement2510 includes a forwarding table that includes rules (e.g., flow entries) for processing and routing thepacket4730. When the managed switchingelement2510 receives thepacket4730 from the managed switchingelement2510, the managed switchingelement2510 begins processing thepacket4730 based on the forwarding tables of the managed switchingelement2510. The managedswitching element2510 identifies a record indicated by an encircled 1 (referred to as “record 1”) in the forwarding tables that implements the context mapping. Therecord 1 identifies thepacket4730's logical context based on the logical context that is stored in thepacket4730's header. The logical context specifies that theL2 processing205 and theL3 processing210 have been performed on thepacket4730 by the managed switchingelement810. Therecord 1 specifies that thepacket4730 be further processed by the forwarding tables (e.g., by sending thepacket4730 to a dispatch port).
Based on the logical context and/or other fields stored in thepacket4730's header, the managed switchingelement2510 identifies a record indicated by an encircled 2 (referred to as “record 2”) in the L2 forwarding table that implements the L2 ingress ACL. In this example, therecord 2 allows thepacket4730 to come through the logical port Y of the logical switch230 (not shown) and, thus, specifies thepacket4730 be further processed by the managed switching element2510 (e.g., by sending thepacket4730 to a dispatch port). In addition, therecord 2 specifies that the managed switchingelement2510 store the logical context (i.e., thepacket4730 has been processed by the stage4762 of the processing pipeline4700) of thepacket4730 in the set of fields of thepacket4730's header.
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket4730's header, a record indicated by an encircled 3 (referred to as “record 3”) in the L2 forwarding table that implements the logical L2 forwarding. Therecord 3 specifies that a packet with the MAC address ofVM 4 as destination MAC address should be forwarded through alogical port 2 of thelogical switch230 that is connected toVM 4.
Therecord 3 also specifies that thepacket4730 be further processed by the forwarding tables (e.g., by sending thepacket4730 to a dispatch port). Also, therecord 3 specifies that the managed switchingelement2510 store the logical context (i.e., thepacket4730 has been processed by the stage4766 of the processing pipeline4700) in the set of fields of the packet
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket4730's header, a record indicated by an encircled 4 (referred to as “record 4”) in the forwarding tables that implements the egress ACL. In this example, therecord 4 allows thepacket4730 to be further processed and, thus, specifies thepacket4730 be further processed by the forwarding tables (e.g., by sending thepacket4730 to a dispatch port). In addition, therecord 4 specifies that the managed switchingelement2510 store the logical context (i.e., thepacket4730 has been processed for L2 egress ACL of the logical switch230) of thepacket4730 in the set of fields of thepacket4730's header.
Based on the logical context and/or other fields stored in thepacket4730's header, the managed switchingelement2505 then identifies a record indicated by an encircled 5 (referred to as “record 5”) in the L2 entries4715 that implements the context mapping. In this example, therecord 5 identifiesport 5 of the managed switchingelement2510 to whichVM 4 is coupled as the port that corresponds to thelogical port 2 of thelogical switch230 to which thepacket4730 is to be forwarded. Therecord 5 additionally specifies that thepacket4730 be further processed by the forwarding tables (e.g., by sending thepacket4730 to a dispatch port).
Next, the managed switchingelement2510 identifies, based on the logical context and/or other fields stored in thepacket4730's header, a record indicated by an encircled 6 (referred to as “record 6”) in the forwarding tables that implements the physical mapping. Therecord 6 specifies theport 5 of the managed switchingelement2510 through which thepacket4730 is to be sent in order for thepacket4730 to reachVM 4. In this case, the managed switchingelement2510 is to send thepacket4730 out ofport 5 of managed switchingelement2510 that is coupled toVM 4. In some embodiments, the managed switchingelement2510 removes the logical context from thepacket4730 before sending the packet toVM 4.
The execution of all the pipelines on the logical path of a packet has implications to the distributed lookups, namely ARP and learning. As the lookups can now be executed by any edge switching element having a logical port attached to the logical network, the total volume of the lookups is going to exceed the lookups executed on a similar physical topology; even though the packet would head towards the same port, differing senders cannot share the cached lookup state, as the lookups will be initiated on different managed edge switching elements. Hence, the problems of flooding are amplified by the logical topology and a unicast mapping based approach for lookups is preferred in practice.
By sending a special lookup packet towards a cloud of mapping servers (e.g., pool or root nodes), the source edge switching element can do the necessary lookups without resorting to flooding. In some embodiments, the mapping server benefits from heavy traffic aggregate locality (and hence good cache hit ratios on client side) as well as from datapath-only implementation resulting in excellent throughput.
FIG. 48 conceptually illustrates an example software architecture of ahost4800 on which a managed switching element runs. Specifically, this figure illustrates that thehost4800 also runs an L3 daemon that resolves an L3 address (e.g., an IP address) into an L2 address (e.g., a MAC address) for a packet that the L3 daemon receives from the managed switching element. This figure illustrates that thehost4800 includes a managedswitching element4805, a forwarding table4820, anL3 daemon4810, and a mapping table4815 in the top half of the figure. This figure also illustratesflow entries4825 and4830.
Theflow entries4825 and4830 each has a qualifier and an action. The text illustrated asflow entries4825 and4830 may not be an actual format. Rather, the text is just a conceptual illustration of a qualifier and an action pair. In some embodiments, flow entries have priorities and a managed switching element takes the action of the flow entry with the highest priority when qualifiers for more than one flow entry are satisfied.
Thehost4800, in some embodiments, is a machine operated by an operating system (e.g., Windows™ and Linux™) that is capable of running a set of software applications. The managedswitching element4805 of some embodiment is a software switching element (e.g., Open vSwitch) that executes in thehost4800. As mentioned above, a controller cluster (not shown) configures a managed switching element by supplying flow entries that specify the functionality of the managed switching element. The managedswitching element4805 of some embodiments does not itself generate flow entries and ARP requests.
The managedswitching element4805 of some embodiments runs all or part of thelogical processing pipeline200 described above. In particular, the managed switchingelement4805 is a managed switching element (e.g., the managed switchingelements1720 or2505) that performs theL3 processing210 to route packets received from the machines as necessary, based on flow entries in the forwarding table4820. In some embodiments, the managed switchingelement4805 is an edge switching element that receives a packet from a machine (not shown) that is coupled to the managed switching element. In some such embodiments, one or more virtual machines (not shown) are running in thehost4800 and are coupled to the managed switchingelements4805. In other embodiments, the managed switching element is a second-level managed switching element.
When the managed switchingelement4805 receives a packet that is the very first packet being sent to a destination machine that is in another logical network (or the packet itself is an ARP request), the managed switchingelement4805 of these embodiments would not yet know the MAC address of the destination machine. In other words, the managed switchingelement4805 would not know the mapping between the next-hop IP address and the destination MAC address. In order to resolve the next-hop IP address into the destination MAC address, the managed switchingelement4805 of some embodiments requests the destination MAC address of the packet from theL3 daemon4810.
TheL3 daemon4810 of some embodiments is a software application running on thehost4800. TheL3 daemon4810 maintains the table4815 which includes mappings of IP and MAC addresses. When the managed switchingelement4805 asks for a destination MAC address that corresponds to a next-hop IP address, the L3 daemon looks up the mapping table4815 to find the destination MAC address to which the source IP address is mapped. (In some cases, the destination MAC address to which the source IP address is mapped is the MAC address of the next-hop logical router).
The managedswitching element4805 and theL3 daemon4810 of different embodiments uses different techniques to ask for and supply addresses. For instance, the managed switchingelement4805 of some embodiments sends a packet, which has a destination IP address but does not have a destination MAC address, to the L3 daemon. TheL3 daemon4810 of these embodiments resolves the IP address into a destination MAC address. TheL3 daemon4810 sends the packet back to the managed switchingelement4805, which will perform logical forwarding and/or routing to send the packet towards the destination machine. In some embodiments, the managed switchingelement4805 initially sends metadata, along the packet that contains a destination IP address to resolve, to theL3 daemon4810. This metadata includes information (e.g., register values, logical pipeline state, etc.) that the managed switchingelement4805 uses to resume performing the logical processing pipeline when the managed switchingelement4805 receives the packet back from theL3 daemon4810.
In other embodiments, the managed switchingelement4805 requests a destination address by sending a flow template, which is a flow entry that does not have actual value for the destination MAC addresses, to theL3 daemon4810. TheL3 daemon4810 finds the destination MAC addresses to fill in the flow template by looking up the mapping table4815. TheL3 daemon4810 then sends the flow template that is filled in with actual destination MAC addresses back to the managed switchingelement4810 by putting the filled-in flow template into the forwarding table4820. In some embodiments, the L3 daemon assigns the filled-in flow template a priority value that is higher than the priority value of the flow template that is not filled in.
When the mapping table4815 has an entry for the destination IP address and the entry has the destination MAC address mapped to the destination IP address, theL3 daemon4810 uses the destination MAC address to write in the packet or fill in the flow template. When there is no such entry, the L3 daemon generates an ARP request and broadcasts the ARP packet to other hosts or VMs that run L3 daemons. In particular, the L3 daemon of some embodiments only sends the ARP requests to those hosts or VMs to which the next-hop logical L3 router may be attached. The L3 daemon receives a response to the ARP packet that contains the destination MAC address from one of the hosts or VMs that received the ARP packet. TheL3 daemon4810 maps the destination IP address to the destination MAC address and adds this mapping to the mapping table4815. In some embodiments, theL3 daemon4810 sends a unicast packet periodically to another L3 daemon that responded to the ARP request to check the validity of the destination MAC address. In this manner, theL3 daemon4810 keeps the IP and MAC addresses mapping up to date.
In some embodiments, when theL3 daemon4810 still fails to find a resolved address after looking up the flow entries and sending ARP requests to other L3 daemon instances, the L3 daemon would specify in the flow template to drop the packet or the L3 daemon itself will drop the packet.
When the managed switchingelement4805 receives an ARP packet from another host or VM, the managed switchingelement4805 of some embodiments does not forward the ARP packet to the machines that are coupled to the managed switching element. The managedswitching element4800 in these embodiments sends the ARP packet to the L3 daemon. The L3 daemon maintains in the mapping table4815 mapping between IP addresses and MAC addresses that are locally available (e.g., IP addresses and MAC addresses of the machines that are coupled to the managed switching element4805). When the mapping table4815 has an entry for the IP address of the received ARP packet and the entry has a MAC address of a VM that is coupled to the managed switchingelement4805, the L3 daemon sends the MAC address, in the response to the ARP packet, to the host or VM (i.e., the L3 daemon of the host or VM) from which the ARP packet originates.
An example operation of the managed switchingelement4805 and theL3 daemon4810 will now be described in terms of three different stages 1-3 (encircled 1-3). In this example, the managed switchingelement4805 is a managed edge switching element that receives a packet to forward and route from a machine (not shown). The managedswitching element4805 receives a packet and performs thelogical processing200 based on the flow entries in the forwarding table4820.
When the packet is the very first packet that bears the IP address of the destination machine or the packet is an ARP request from a source machine, the managed switching element4820 (at stage 1) identifies theflow entry4825 and performs the action specified in theflow entry4825. As shown, theflow entry4825 indicates that a flow template having a destination IP address 1.1.2.10 to be resolved to a destination MAC X should be sent to theL3 daemon4810. In this example, theflow entry4825 has a priority value of N, which is a number in some embodiments.
Atstage 2, theL3 daemon4810 receives the flow template and finds out that 1.1.2.10 is to be resolved to 01:01:01:01:01:09 by looking up the mapping table4815. The L3 daemon fills out the flow template and inserts the filled-in template (now the flow entry4830) into the forwarding table4830. In this example, the L3 daemon assigns a priority of N+1 to the filled-in template.
Atstage 3, the managed switchingelement4810, in some embodiments, uses theflow entry4830 to set the destination MAC address for the packet. Also, for the packets that the managed switchingelement4810 subsequently processes, the managed switchingelement4805 usesflow entry4830 over theflow entry4825 when a packet has the destination IP address of 1.1.2.10.
In some embodiments, theL3 daemon4810 and the managed switching element runs in a same virtual machine that is running on thehost4800 or in different virtual machines running on thehost4800. In some embodiments, theL3 daemon4810 runs in the user space of a virtual machine. TheL3 daemon4810 and the managed switching element may also run in separate hosts.
In some embodiments, the managed switchingelement4805 does not rely on theL3 daemon4810 to resolve addresses. In some such embodiments, the control cluster (not shown inFIG. 48) may statically configure theflow entries4820 such that theflow entries4820 include the mappings between IP addresses to MAC addresses obtained through API calls (i.e., inputs) or DHCP.
FIG. 49 conceptually illustrates aprocess4900 that some embodiments perform to resolve network addresses. In some embodiments, theprocess4900 is performed by a managed switching element that performs anL3 processing210 to route packets at L3 (e.g., the managed switchingelements1720,2505, or3105). Theprocess4900, in some embodiments, starts when the process receives a packet that is to be logically routed at L3.
Theprocess4900 begins by determining (at4905) whether the packet needs address resolution (e.g., resolving a destination IP address to a destination MAC address). In some embodiments, the process determines whether the packet needs L3 processing based on flow entry. The flow entry, of which the qualifier matches the information stored in the packet's header or logical context, specifies that the packet needs address resolution.
When theprocess4900 determines (at4905) that the packet does not need address resolution, the process ends. Otherwise, theprocess4900 determines (at4910) whether theprocess4900 needs to request an address into which to resolve a packet's address (e.g., destination IP address) from an L3 daemon. In some embodiments, theprocess4900 determines whether the process needs to ask the L3 daemon based on the flow entry. For instance, the flow entry may specify that the address into which to resolve the packet's address should be obtained by requesting for the resolved address from the L3 daemon. In some embodiments, the process determines that the L3 daemon should provide the resolved address when the flow entry is a flow template that has an empty field for the resolved address or some other value in the field for indicating the resolved address should be obtained from the L3 daemon.
When the process determines (at4910) that the process does not need to request for an address from the L3 daemon, the process obtains (at4920) the resolved address from the flow entry. For instance, the flow entry would provide the translated address. The process then proceeds to4925, which will be described further below. When the process determines (at4910) that the process needs to request for an address from the L3 daemon, theprocess4900 at4915 requests for and obtains the resolved address from the L3 daemon. In some embodiments, theprocess4900 requests for the resolved address by sending a flow template to the L3 daemon. The L3 daemon would fill the flow template with the resolved address and place that filled-in flow template in the forwarding table (not shown) that the process uses.
Next, theprocess4900 modifies the packet with the resolved address. In some embodiments, the process modifies an address field in the header of the packet. Alternatively or conjunctively, the process modifies the logical context to replace the packet's address with the resolved address. The process then ends.
FIG. 50 illustratesnetwork architecture5000 of some embodiments. Specifically, this figure illustrates a map server that allows several hosts (or VMs) that each run an L3 daemon to avoid broadcasting ARP requests. This figure illustrates a set of hosts (or VMs) including5005,5010, and5015.
Thehosts5010 and5015 are similar to thehost4805 described above by reference toFIG. 48 in that each of the hosts5010 and5010 runs an L3 daemon, a managed switching element, and one or more VMs.
Thehost5005 runs a map server. Themap server5005 of some embodiments maintains a global mapping table5020 that includes all the entries of all mapping tables maintained by L3 daemons running in every host in the network that runs a managed edge switching element. In some embodiments, an L3 daemon in the network sends the entries of mapping between locally available IP addresses and MAC addresses mappings. Whenever there is a change to the machines coupled to a managed switching element of a host (e.g., when a VM fails or is coupled to or de-coupled from the managed switching element), the L3 daemon of the host updates the respective local mapping table accordingly and also sends the updates (e.g., by sending a special “publish” packet containing the updates) to themap server5005 so that themap server5005 keeps the global mapping table5005 updated with the change.
In some embodiments, the L3 daemon running in each host that runs a managed edge switching element does not broadcast an ARP packet when the local mapping does not have an entry for a destination IP address to resolve. Instead, the L3 daemon consults themap server5005 to resolve the destination IP address into the destination MAC address. Themap server5005 resolves the destination IP address into a destination MAC address by looking up the global mapping table5020. In the case that themap server5005 cannot resolve the IP address (e.g., when the global mapping table5020 does not have an entry for the IP address or themap server5005 fails), the L3 daemon will resort to broadcasting an ARP packet to other hosts that run managed edge switching elements. In some embodiments, themap server5005 is implemented in the same host or VM in which a second-level managed switching element (e.g., a pool node) is implemented.
FIG. 51 illustrates aprocess5100 that some embodiments perform to maintain a mapping table that includes mappings of IP and MAC addresses. In some embodiments, theprocess5100 is performed by an L3 daemon that requests for resolved addresses from a mapping server. The mapping server in these embodiments maintains a global mapping table that includes mappings of IP and MAC addresses for a set of managed switching elements. Theprocess5100, in some embodiments, starts when the process receives a particular address to resolve from a managed switching element.
The process begins by determining (at5105) whether the process has a resolved address for the particular address received from the managed switching element. In some embodiments, the process looks up a local mapping table that includes mappings of IP and MAC addresses to determine whether the process has a resolved address for the particular address.
When theprocess5100 determines that the process has a resolved address, the process proceeds to5120, which will be described further below. Otherwise, theprocess5100 requests for and obtains a resolved address from the map server. Theprocess5100 then modifies (at5115) the local mapping table with the resolved address obtained from the mapping server. In some embodiments, theprocess5100 inserts a new mapping of the resolved address and the particular address into the local mapping table.
Theprocess5100 then sends the resolved address to the managed switching element. In some embodiments, theprocess5100 modifies the packet that has the particular address. In other embodiments, theprocess5100 modifies the flow template that the managed switching element had sent as a request for the resolved address. The process then ends.
FIG. 52 illustrates aprocess5200 that some embodiments perform to maintain a mapping table that includes mappings of IP and MAC addresses. In some embodiments, theprocess5200 is performed by an L3 daemon that maintains a local mapping table and sends updates to a mapping server. The mapping server in these embodiments maintains a global mapping table that includes mappings of IP and MAC addresses for a set of managed switching elements. Theprocess5200, in some embodiments, starts when the L3 daemon starts running.
Theprocess5200 begins by monitoring (at5205) a set of managed switching elements. In particular, theprocess5200 monitors for coupling and decoupling of machines to and from a managed switching element or any address change for the machines coupled to a managed switching element. In some embodiments, the set of managed switching elements includes those managed switching elements that are running on the same host or virtual machine on which the L3 daemon is running.
Next, theprocess5200 determines (at5210) whether there has been such a change to a managed switching element that the process monitors. When the process determines (at5210) that there has not been a change, theprocess5200 loops back to5205 to keep monitoring the set of managed switching elements. Otherwise, the process modifies (at5215) the corresponding entries in the local mapping table. For instance, when a VM migrates and gets coupled to one of the managed switching element in the set, the process inserts a mapping of the IP address and the MAC address of the migrated VM into the local mapping table.
Theprocess5200 then sends the updated mapping to the map server so that the map server can update the global mapping table with the new and/or modified mapping of the IP address and MAC address. The process then ends.
VIII. Flow Generation and Flow Processing
As described above, the managed switching elements of some embodiments implement logical switches and logical routers based on flow tables supplied to the managed switching elements by the controller cluster (one or more controller instances) of some embodiments. In some embodiments, the controller cluster generates these flow entries by performing table mapping operations based on the inputs or network events the controller cluster detects. Details of these controller clusters and their operations are described in U.S. patent application Ser. No. 13/177,533, and in the above-incorporated U.S. patent application.
As mentioned in U.S. patent application Ser. No. 13/589,077, the network control system in some embodiments is a distributed control system that includes several controller instances that allow the system to accept logical datapath sets from users and to configure the switching elements to implement these logical datapath sets. In some embodiments, one type of controller instance is a device (e.g., a general-purpose computer) that executes one or more modules that transform the user input from a logical control plane to a logical forwarding plane, and then transform the logical forwarding plane data to physical control plane data. These modules in some embodiments include a control module and a virtualization module. A control module allows a user to specify and populate logical datapath set, while a virtualization module implements the specified logical datapath set by mapping the logical datapath set onto the physical switching infrastructure. In some embodiments, the control and virtualization applications are two separate applications, while in other embodiments they are part of the same application.
From the logical forwarding plane data for a particular logical datapath set, the virtualization module of some embodiments generates universal physical control plane (UPCP) data that is generic for any managed switching element that implements the logical datapath set. In some embodiments, this virtualization module is part of a controller instance that is a master controller for the particular logical datapath set. This controller is referred to as the logical controller.
In some embodiments, the UPCP data is then converted to customized physical control plane (CPCP) data for each particular managed switching element by a controller instance that is a master physical controller instance for the particular managed switching element, or by a chassis controller for the particular managed switching element, as further described in U.S. patent application Ser. No. 13/589,077. When the chassis controller generates the CPCP data, the chassis controller obtains the UPCP data from the virtualization module of the logical controller through the physical controller.
Irrespective of whether the physical controller or chassis controller generate the CPCP data, the CPCP data for a particular managed switching element needs to be propagated to the managed switching element. In some embodiments, the CPCP data is propagated through a network information base (NIB) data structure, which in some embodiments is an object-oriented data structure. Several examples of using the NIB data structure are described in U.S. patent application Ser. Nos. 13/177,529 and 13/177,533, which are incorporated herein by reference. As described in these applications, the NIB data structure is also used in some embodiments to may serve as a communication medium between different controller instances, and to store data regarding the logical datapath sets (e.g., logical switching elements) and/or the managed switching elements that implement these logical datapath sets.
However, other embodiments do not use the NIB data structure to propagate CPCP data from the physical controllers or chassis controllers to the managed switching elements, to communicate between controller instances, and to store data regarding the logical datapath sets and/or managed switching elements. For instance, in some embodiments, the physical controllers and/or chassis controllers communicate with the managed switching elements through OpenFlow entries and updates over the configuration protocol. Also, in some embodiments, the controller instances use one or more direct communication channels (e.g., RPC calls) to exchange data. In addition, in some embodiments, the controller instances (e.g., the control and virtualization modules of these instance) express the logical and/or physical data in terms of records that are written into the relational database data structure. In some embodiments, this relational database data structure are part of the input and output tables of a table mapping engine (called nLog) that is used to implement one or more modules of the controller instances.
FIG. 53 conceptually illustrates three controller instances of a controller cluster of some embodiments. These three controller instances include alogical controller5300 for generating UPCP data from logical control plane (LCP) data received as API calls, andphysical controllers5390 and5330 for customizing the UPCP data specific to managed switchingelements5320 and5325, respectively. Specifically, thelogical controller5300 of some embodiments generates universal flows by performing table mapping operations on tables using a table mapping processor (not shown) such as an nLog. An nLog engine is described in U.S. patent application Ser. No. 13/177,533. This figure also illustrates auser5325 and managed switchingelements5320 and5325.
As shown, thelogical controller5300 includes acontrol application5305 and avirtualization application5310. In some embodiments, thecontrol application5305 is used to receive the logical control plane data, and to convert this data to logical forwarding plane data that is then supplied to thevirtualization application5310. Thevirtualization application5310 generates universal physical control plane data from logical forwarding plane data.
In some embodiments, some of the logical control plane data are converted from the inputs. In some embodiments, thelogical controller5300 supports a set of API calls. The logical controller has an input translation application (not shown) that translates the set of API calls into LCP data. Using the API calls, the user can configure logical switches and logical routers as if the user is configuring physical switching elements and routers.
Thephysical controllers5390 and5330 are the masters of the managed switchingelements5320 and5325, respectively. Thephysical controller5390 and5330 of some embodiments receive the UPCP data from thelogical controller5300 and converts the UPCP data to CPCP data for the managed switchingelements5320 and5325, respectively. Thephysical controller5390 then sends the CPCP data for the managed switchingelement5320 to the managed switchingelement5320. Thephysical controller5330 sends the CPCP data for the managed switchingelement5325 to the managed switchingelement5325. The CPCP data for the managed switchingelements5320 and5325 are in the form of flow entires. The managedswitching elements5320 and5325 then performs forwarding and routing the packets based on the flow entries. As described in in U.S. patent application Ser. No. 13/177,533, this conversion of LCP data to the LFP data and then to the CPCP data is performed by using an nLog engine.
Even thoughFIG. 53 illustrates two physical controllers generating CPCP data from UPCP data for two different managed switching elements, one of ordinary skill will realize that in other embodiment the physical controllers serve to simply relay the UPCP data to each switching element's chassis controller, which in turn generates that switching element's CPCP data and pushes this data to it switching element.
FIG. 54 illustrates anexample architecture5400 and auser interface5405. Specifically, this figure illustrates that the user sends to a controller application in order to configure logical switches and routers in a desired way. This figure illustrates a user interface (UI)5405 in four stages5406-5409 in the left half of the figure. This figure also illustrates thearchitecture5400, which includes alogical router5425 and twological switches5420 and5430 in the right half of the figure.
TheUI5405 is an example interface through which the user can enter inputs and receive responses from a controller instance in order to manage the logical switches and routers. In some embodiments, theUI5405 is provided as a web application and thus can be opened up with a web browser. Alternatively or conjunctively, the control application of some embodiments may allow the user to enter and receive inputs through a command line interface.
The left half of the figure illustrates that the user enters inputs to set up logical ports in logical switches and logical routers that are to be implemented by a set of managed switching elements of the network that the controller instance manages. In particular, the user adds a logical port to a logical router, LR, by supplying (at stage5406) the port's identifier, “RP1,” an IP address of “1.1.1.253” to associate with the port, and a net mask “255.255.255.0.” The user also adds a logical port to a logical switch, LS1, by supplying (at5407) a port identifier, “SP1,” and specifying that the port is to be connected to the logical port RP1 of the logical router. The user also adds another logical port to the logical router LR by supplying (at stage5408) the port's identifier, “RP2,” an IP address of “1.1.2.253” to associate with the port, and a net mask “255.255.255.0.” The user also adds another logical port to the logical switch LS2 by supplying (at5409) a port identifier, “SP2,” and specifying that the port is to be connected to the logical port RP2 of the logical router. The right half of the figure illustrates the ports added to the logical router and logical switches.
FIGS. 55-62 conceptually illustrates an example operation of thecontrol application5305. These figures illustrate a set of tables that thecontrol application5305 uses and modifies in order to generate flow entries to be supplied to managed switching elements. Specifically, the managed switching elements (not shown) implement the logical ports added to thelogical switches5420 and5430 and thelogical router5400 based on the inputs described above by reference toFIG. 54. The figure illustrates thecontrol application5305, thevirtualization application5310, and thephysical controller5330.
Thecontrol application5305 as shown includesinput translation5505, input tables5510, arules engine5515, output tables5520, aexporter5525.
Theinput translation5505, in some embodiments, interacts with a management tool with which a user can view and/or modify a logical network state. Different embodiments provide different management tools to the user. For instance, theinput translation5505, in some embodiments, provides a graphical tool such as theUI5405 described above by reference toFIG. 54. Instead of, or in conjunction with, a graphical tool, other embodiments may provide the user with a command-line tool or any other type of management tool. Theinput translation5505 receives inputs from the user through the management tool and processes the received inputs to create, populate and/or modify one or more input tables5510.
The input tables5510 are similar to the input tables described in U.S. patent application Ser. No. 13/288,908, which is incorporated herein by reference. An input table in some cases represents the state of the logical switches and the logical routers that the user is managing. For instance, an input table5530 is a table that stores IP addresses in classless inter-domain routing (CIDR) format, associated with logical ports of logical switches. The control application modifies input tables with inputs that the control application receives through the management tool or with any network events that the control application detects. After thecontrol application5305 modifies input tables, thecontrol application5305 uses therules engine5515 to process the modified input tables.
Therules engine5515 of different embodiments performs different combinations of database operations on different sets of input tables to populate and/or modify different sets of output tables5520. For instance, therules engine5515 modifies a table5535 to associate a MAC address to a logical port of a logical router when the input table5530 is changed to indicate that the logical port of the logical router is created. The output table5565 includes flow entries that specify the actions for the managed switching elements that implement the logical switches and logical routers to perform on the network data that is being routed/forwarded. In addition to the tables5530-5560, therules engine5515 may use other input tables, constants tables, and functions tables to facilitate the table mapping operation of therules engine5515.
The output tables may also be used as input tables to therules engine5515. That is, a change in the output tables may trigger another table mapping operation to be performed by therules engine5515. Therefore, the entries in the tables5530-5560 may be resulted from performing table mapping operations and may also provide inputs to therules engine5515 for another set of table mapping operations. As such, the input tables and the output tables are depicted in a single dotted box in this figure to indicate the tables are input and/or output tables.
The table5535 is for storing pairings of logical ports of logical routers and the associated MAC addresses. The table5540 is a logical routing table for a logical router to use when routing the packets. In some embodiments, the table5540 will be sent to the managed switching element that implements the logical router. The table5550 is for storing next-hop identifiers and IP addresses for logical ports of logical routers. The table5555 is for storing connections between logical ports of logical switches and logical ports of logical routers. Theexporter5525 publishes or sends the modified output tables in the output tables5520 to avirtualization application5310.
FIG. 55 illustrates the tables5530-5565 before thestage5406 described above by reference toFIG. 54. The entries in the tables are depicted as dots to indicate there are some existing entries in these tables.
FIG. 56 illustrates the tables5530-5565 after thestage5406. That is, this figure illustrates the tables5530-5565 after the user supplies a logical port's identifier, “RP1,” an IP address of “1.1.1.253” to associate with the port, and a net mask “255.255.255.0.” to add the logical port to thelogical router5425, identified as “LR.” Here, the table5530 is updated with a new entry by theinput translation5505. The new entry (or row)5601 indicates a logical port identified as “RP1” is added and the IP addresses associated with this port is specified by the IP address 1.1.1.253, aprefix length24, and the net mask 255.255.255.0.
Therules engine5515 detects this update to the table5530 and performs a set of table mapping operations to update the tables5535 and5540.FIG. 57 illustrates the result of this set of table mapping operations. Specifically, this figure illustrates that the table5535 has anew row5701, which indicates that the logical port RP1 is now associated with a MAC address 01:01:01:01:01:01. This MAC address is generated by therules engine5515 while performing the table mapping operations using other tables or functions (not shown).
FIG. 57 also illustrates that the table5540 has anew row5702, which is an entry in the routing table for thelogical router5425. The logical router5425 (the managed switching element that implements the logical router5425) will look up this table5540 to make a routing decision. Therow5702 specifies that the next hop for the logical port RP1 has a unique identifier “NH1.” Therow5702 also includes a priority assigned to this row in the routing table. This priority is used for determining which row should be used to make a routing decision when there are multiple matching rows in the routing table. In some embodiments, the value for the priority for a row in an entry is prefix length plus a basic priority value “BP.”
Therules engine5515 detects the update to the table5540 and performs a set of table mapping operations to update the table5550.FIG. 58 illustrates the result of this set of table mapping operations. Specifically, this figure illustrates that the table5550 has anew row5801, which indicates that the IP address of the next hop for the logical port RP1 of thelogical router5425 is a given packet's destination IP address. (“0” in this row means that the next hop's IP is the destination of the given packet that would be routed through RP1 of the logical router.)
FIG. 59 illustrates the tables5530-5560 after thestage5407 described above by reference toFIG. 54. That is, this figure illustrates the tables5530-5565 after the user supplies a logical port's identifier, “SP1,” to add the logical port to the logical switch5420 (LS1) and links this port to the logical port RP1 of thelogical router5425. Here, the table5555 is updated with two new rows by theinput translation5505. Thenew row5901 indicates that a logical port identified as “SP1” (of the logical switch5420) is attached to the logical port RP1 (of the logical router5425). Also, thenew row5902 indicates that the logical port RP1 is attached to the logical port SP1. This link connects L2 processing and L3 processing portions of thelogical processing pipeline200 described above.
Therules engine5515 detects the updates to the table5555 and performs a set of table mapping operations to update the table5535.FIG. 60 illustrates the result of this set of table mapping operations. Specifically, this figure illustrates that the table5535 has anew row6001, which indicates that the logical port SP1 is now associated with a MAC address 01:01:01:01:01:01 because SP1 and RP1 are now linked.
Therules engine5515 detects the updates to the table5555 and performs a set of table mapping operations to update the table5560.FIG. 61 illustrates the result of this set of table mapping operations. Specifically, this figure illustrates that the table5550 has four new rows (flow entries)6101-6104. Therow6101 is a flow entry indicating that packets whose destination MAC addresses is 01:01:01:01:01:01 are to be sent to the logical port SP 1 (of the logical switch5420). Therow6102 is a flow entry indicating that any packet delivered to the logical port SP1 is to be sent to the logical port RP1. Therow6103 is a flow entry indicating that any packet delivered to the logical port RP1 is to be sent to the logical port SP1. Therow6104 is a flow entry indicating that a packet with an IP address that falls within the range of IP addresses specified by 1.1.1.253/24 should request for MAC address by asking an L3 daemon.
FIG. 62 illustrates new rows6201-6209 added to some of the tables afterstages5408 and5409 described above. For simplicity of description, the intermediate illustration of table updates by therules engine5515 is omitted.
Thenew row6201 indicates a logical port identified as “RP2” is added and the IP addresses associated with this port is specified by the IP address 1.1.2.253, aprefix length24, and the net mask 255.255.255.0. Thenew row6202, which indicates that the logical port RP2 is now associated with a MAC address 01:01:01:01:01:02. Thenew row6203, which indicates that the logical port SP2 is associated with a MAC address 01:01:01:01:01:02. Thenew row6204, which is an entry in the routing table for thelogical router5430. Therow6204 specifies that the next hop for the logical port RP2 has a unique identifier “NH2.” Therow6204 also includes a priority assigned to this row in the routing table.
Thenew row6205 indicates that the IP address of the next hop for the logical port RP2 of thelogical router5425 is a given packet's destination IP address. Thenew row6206 indicates that a logical port identified as “SP2” (of the logical switch5430) is attached to the logical port RP2 (of the logical router5425). Also, thenew row6207 indicates that the logical port RP2 is attached to the logical port SP2.
Therow6208 is a flow entry indicating that packets whose destination MAC addresses is 01:01:01:01:01:02 are to be sent to the logical port SP2 (of the logical switch5430). Therow6209 is a flow entry indicating that any packet delivered to the logical port SP2 is to be sent to the logical port RP2. Therow6210 is a flow entry indicating that any packet delivered to the logical port RP2 is to be sent to the logical port SP2. Therow6211 is a flow entry indicating that a packet with an IP address that falls within the range of IP addresses specified by 1.1.2.253/24 should request for MAC address by asking an L3 daemon.
These flow entries shown inFIG. 62 are LFP data. This LFP data will be sent to thevirtualization application5310, which will generate UPCP data from the LFP data. Then, the UPCP data will be sent to thephysical controller5330 which will customize the UPCP data for the managed switching element5325 (not shown inFIG. 62). Finally, thephysical controller5330 will send the CPCP data to the managed switchingelement5325.
FIG. 63 illustrates thearchitecture5400 after thecontrol application5305 generates logical data by performing the table mapping operations as described above by reference toFIGS. 55-62. As shown inFIG. 63, the ports RP1 and RP2 are associated with ranges of IP addresses specified by 1.1.1.253/24 and 1.1.2.253/24, respectively. Also, the ports SP1 and SP2 are associated with MAC addresses 01:01:01:01:01:01 and 01:01:01:01:01:02, respectively. This figure also illustratesVM 1 that is coupled to thelogical switch5420 andVM 2 that is coupled to thelogical switch5430.
An example operation of thelogical switches5420 and5430, thelogical router5425, andVMs 1 and 2 will now be described. This example assumes that a set of managed switching elements that implement thelogical router5425 and thelogical switches5420 and5430 have all the flow entries6101-6104 and6208-6211. This example also assumes that the logical data produced by thecontrol application5305 are converted to physical control plane data by thevirtualization application5310 and that the physical control plane data is received by the managed switching elements and converted into physical forwarding data.
WhenVM 1 intends to send a packet toVM 4,VM 1 first broadcasts an ARP request to resolve thelogical router5425's MAC address. This ARP packet has a source IP address ofVM 1, which is 1.1.1.10 in this example, and a destination IP address ofVM 4, which is 1.1.2.10 in this example. This broadcast packet has the broadcast MAC address “ffiff:ff:ff:ff:ff” as the destination MAC address and the packet's target protocol address is 1.1.1.253. This broadcast packet (the ARP request) is replicated to all ports of thelogical switch5320 including the logical port SP1. Then, based onflow entry6102, this packet is sent to RP1 of thelogical router5325. The packet is then sent to an L3 daemon (not shown) according to theflow entry6104 because the destination IP address 1.1.2.10 falls in the range of IP addresses specified by 1.1.2.253/24 (i.e., because the target protocol address is 1.1.1.253). The L3 daemon resolves the destination IP address to a MAC address 01:01:01:01:01:01, which is the MAC address of RP1. The L3 daemon sends the ARP response with this MAC address back toVM 1.
VM 1 then sends a packet toVM 4. This packet hasVM 1's MAC address as the source MAC address, RP1's MAC address (01:01:01:01:01:01) as a destination MAC address,VM 1's IP address (1.1.1.10) as the source IP address, andVM 4's IP address (1.1.2.10) as the destination IP address.
Thelogical switch5420 then forwards this packet to SP1 according to theflow entry6101 which indicates that a packet with the destination MAC address of 01:01:01:01:01:01 is to be sent to SP1. When the packet reaches SP1, the packet is then send to RP1 according to theflow entry6102, which indicates that any packet delivered to SP1 to be sent to RP1.
This packet is then sent to the ingress ACL stage of thelogical router5425, which in this example allows the packet to go through RP1. Then thelogical router5425 routes the packet to the next hop, NH2, according to theentry6204. This routing decision is then loaded to a register (of the managed switching element that implements the logical router5425). This packet is then fed into the next hop lookup process, which uses the next hop's ID, NH2, to determine the next-hop IP address and the port the packet should be sent to. In this example, the next hop is determined based on therow6205 which indicates that NH2's address is the destination IP address of the packet and the port the packet should be sent to is RP2.
The packet then is fed into a MAC resolution process to resolve the destination IP address (1.1.2.10) to MAC address ofVM 4. The L3 daemon resolves the MAC address and puts back a new flow entry (e.g., by filling in a flow template with the resolved MAC address) into the managed switching element that implements thelogical router5425. According to this new flow, the packet now hasVM 4's MAC address as the destination MAC address and the MAC address of RP2 (01:01:01:01:01:02) of thelogical router5425.
The packet then goes through the egress ACL stage of thelogical router5425, which in this example allows the packet to exit through RP2. The packet is then sent to SP2 according to theflow entry6210, which indicates that any packet delivered to RP2 is to be sent to SP2. Then the L2 processing for thelogical switch5330 will send the packet toVM 4.
IX. Modification to Managed Edge Switching Element Implementation
While all the LDPS processing is pushed to the managed edge switching elements, only the interfaces to actual attached physical port integration address interoperability issues in some embodiments. These interfaces, in some embodiments, implement the standard L2/L3 interface for the host IP/Ethernet stack. The interfaces between the logical switches and logical routers remain internal to the virtualization application, and hence do not need to implement exactly the same protocols as today's routers to exchange information.
The virtualization application, in some embodiments, has the responsibility to respond to the ARP requests sent to the first-hop router's IP address. Since the logical router's MAC/IP address bindings are static, this introduces no scaling issues. The last-hop logical router, in some embodiments, does not have a similar, strict requirement: as long as the MAC and IP address(es) of the attached port are made known to the virtualization application, it can publish them to the internal lookup service not exposed for the endpoints but only used by the logical pipeline execution. There is no absolute need to send ARP requests to the attached port.
Some embodiments implement the required L3 functionality as an external daemon running next to the Open vSwitch. In some embodiments, the daemon is responsible for the following operations:
- Responding to ARP requests. In some embodiments, Open vSwitch feeds ARP requests to the daemon and the daemon creates a response. Alternatively, some embodiments use flow templating to create additional flow entries in the managed edge switching elements. Flow templating is the use of a set of rules to generate a series of flow entries dynamically based on packets received. In some such embodiments, the responses are handled by the Open vSwitch itself.
- Establishing any stateful (NAT, ACL, load-balancing) per-flow state. Again, if the flow templating is flexible enough, more can be moved for the Open vSwitch to handle.
- Initiating the distributed lookups. Distributed lookups (e.g., ARP, learning) are initiated to the mapping service as necessary when feeding traffic through its sequence of logical pipelines. This will involve queuing of IP packets in some embodiments.
For generating ARP requests when integrating with external physical networks, some embodiments assume that the packet can be dropped to the local IP stack by using the LOCAL output port of OpenFlow.
Mapping service itself is implemented, in some embodiments, by relying on the datapath functionality of the Open vSwitch: daemons at the managed edge switching elements publish the MAC and IP address bindings by sending a special ‘publish’ packet to the mapping service nodes, which will then create flow entries using the flow templating. The ‘query’ packets from the managed edge switching elements will be then responded to by these FIB entries, which will send the packet to the special IN_PORT after modifying the query packet enough to become a response packet.
X. Logical Switching Environment
Several embodiments described above and below provide network control systems that completely separate the logical forwarding space (i.e., the logical control and forwarding planes) from the physical forwarding space (i.e., the physical control and forwarding planes). These control systems achieve such a separation by using a mapping engine to map the logical forwarding space data to the physical forwarding space data. By completely decoupling the logical space from the physical space, the control systems of these embodiments allow the logical view of the logical forwarding elements to remain unchanged while changes are made to the physical forwarding space (e.g., virtual machines are migrated, physical switches or routers are added, etc.).
More specifically, the control system of some embodiments manages networks over which machines (e.g. virtual machines) belonging to several different users (i.e., several different users in a private or public hosted environment with multiple hosted computers and managed forwarding elements that are shared by multiple different related or unrelated users) may exchange data packets for separate LDP sets. That is, machines belonging to a particular user may exchange data with other machines belonging to the same user over a LDPS for that user, while machines belonging to a different user exchange data with each other over a different LDPS implemented on the same physical managed network. In some embodiments, a LDPS (also referred to as a logical forwarding element (e.g., logical switch, logical router), or logical network in some cases) is a logical construct that provides switching fabric to interconnect several logical ports, to which a particular user's machines (physical or virtual) may attach.
In some embodiments, the creation and use of such LDP sets and logical ports provides a logical service model that to an untrained eye may seem similar to the use of a virtual local area network (VLAN). However, various significant distinctions from the VLAN service model for segmenting a network exist. In the logical service model described herein, the physical network can change without having any effect on the user's logical view of the network (e.g., the addition of a managed switching element, or the movement of a VM from one location to another does not affect the user's view of the logical forwarding element). One of ordinary skill in the art will recognize that all of the distinctions described below may not apply to a particular managed network. Some managed networks may include all of the features described in this section, while other managed networks will include different subsets of these features.
In order for the managed forwarding elements within the managed network of some embodiments to identify the LDPS to which a packet belongs, the network controller clusters automatedly generate flow entries for the physical managed forwarding elements according to user input defining the LDP sets. When packets from a machine on a particular LDPS are sent onto the managed network, the managed forwarding elements use these flow entries to identify the logical context of the packet (i.e., the LDPS to which the packet belongs as well as the logical port towards which the packet is headed) and forward the packet according to the logical context.
In some embodiments, a packet leaves its source machine (and the network interface of its source machine) without any sort of logical context ID. Instead, the packet only contains the addresses of the source and destination machine (e.g., MAC addresses, IP addresses, etc.). All of the logical context information is both added and removed at the managed forwarding elements of the network. When a first managed forwarding element receives a packet directly from a source machine, the forwarding element uses information in the packet, as well as the physical port at which it received the packet, to identify the logical context of the packet and append this information to the packet. Similarly, the last managed forwarding element before the destination machine removes the logical context before forwarding the packet to its destination. In addition, the logical context appended to the packet may be modified by intermediate managed forwarding elements along the way in some embodiments. As such, the end machines (and the network interfaces of the end machines) need not be aware of the logical network over which the packet is sent. As a result, the end machines and their network interfaces do not need to be configured to adapt to the logical network. Instead, the network controllers configure only the managed forwarding elements. In addition, because the majority of the forwarding processing is performed at the edge forwarding elements, the overall forwarding resources for the network will scale automatically as more machines are added (because each physical edge forwarding element can only have so many machines attached).
In the logical context appended (e.g., prepended) to the packet, some embodiments only include the logical egress port. That is, the logical context that encapsulates the packet does not include an explicit user ID. Instead, the logical context captures a logical forwarding decision made at the first hop (i.e., a decision as to the destination logical port). From this, the user ID (i.e., the LDPS to which the packet belongs) can be determined implicitly at later forwarding elements by examining the logical egress port (as that logical egress port is part of a particular LDPS). This results in a flat context identifier, meaning that the managed forwarding element does not have to slice the context ID to determine multiple pieces of information within the ID.
In some embodiments, the egress port is a 32-bit ID. However, the use of software forwarding elements for the managed forwarding elements that process the logical contexts in some embodiments enables the system to be modified at any time to change the size of the logical context (e.g., to 64 bits or more), whereas hardware forwarding elements tend to be more constrained to using a particular number of bits for a context identifier. In addition, using a logical context identifier such as described herein results in an explicit separation between logical data (i.e., the egress context ID) and source/destination address data (i.e., MAC addresses). While the source and destination addresses are mapped to the logical ingress and egress ports, the information is stored separately within the packet. Thus, at managed switching elements within a network, packets can be forwarded based entirely on the logical data (i.e., the logical egress information) that encapsulates the packet, without any additional lookup over physical address information.
In some embodiments, the packet processing within a managed forwarding element involves repeatedly sending packets to a dispatch port, effectively resubmitting the packet back into the switching element. In some embodiments, using software switching elements provides the ability to perform such resubmissions of packets. Whereas hardware forwarding elements generally involve a fixed pipeline (due, in part, to the use of an ASIC to perform the processing), software forwarding elements of some embodiments can extend a packet processing pipeline as long as necessary, as there is not much of a delay from performing the resubmissions.
In addition, some embodiments enable optimization of the multiple lookups for subsequent packets within a single set of related packets (e.g., a single TCP/UDP flow). When the first packet arrives, the managed forwarding element performs all of the lookups and resubmits in order to fully process the packet. The forwarding element then caches the end result of the decision (e.g., the addition of an egress context to the packet, and the next-hop forwarding decision out a particular port of the forwarding element over a particular tunnel) along with a unique identifier for the packet that will be shared with all other related packets (i.e., a unique identifier for the TCP/UDP flow). Some embodiments push this cached result into the kernel of the forwarding element for additional optimization. For additional packets that share the unique identifier (i.e., additional packets within the same flow), the forwarding element can use the single cached lookup that specifies all of the actions to perform on the packet. Once the flow of packets is complete (e.g., after a particular amount of time with no packets matching the identifier), in some embodiments the forwarding element flushes the cache. This use of multiple lookups, in some embodiments, involves mapping packets from a physical space (e.g., MAC addresses at physical ports) into a logical space (e.g., a logical forwarding decision to a logical port of a logical switch) and then back into a physical space (e.g., mapping the logical egress context to a physical outport of the switching element).
Such logical networks, that use encapsulation to provide an explicit separation of physical and logical addresses, provide significant advantages over other approaches to network virtualization, such as VLANs. For example, tagging techniques (e.g., VLAN) use a tag placed on the packet to segment forwarding tables to only apply rules associated with the tag to a packet. This only segments an existing address space, rather than introducing a new space. As a result, because the addresses are used for entities in both the virtual and physical realms, they have to be exposed to the physical forwarding tables. As such, the property of aggregation that comes from hierarchical address mapping cannot be exploited. In addition, because no new address space is introduced with tagging, all of the virtual contexts must use identical addressing models and the virtual address space is limited to being the same as the physical address space. A further shortcoming of tagging techniques is the inability to take advantage of mobility through address remapping.
XI. Electronic System
FIG. 64 conceptually illustrates anelectronic system6400 with which some embodiments of the invention are implemented. Theelectronic system6400 can be used to execute any of the control, virtualization, or operating system applications described above. Theelectronic system6400 may be a computer (e.g., a desktop computer, personal computer, tablet computer, server computer, mainframe, a blade computer etc.), phone, PDA, or any other sort of electronic device. Such an electronic system includes various types of computer readable media and interfaces for various other types of computer readable media.Electronic system6400 includes abus6405, processing unit(s)6410, asystem memory6425, a read-only memory6430, apermanent storage device6435,input devices6440, andoutput devices6445.
Thebus6405 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of theelectronic system6400. For instance, thebus6405 communicatively connects the processing unit(s)6410 with the read-only memory6430, thesystem memory6425, and thepermanent storage device6435.
From these various memory units, the processing unit(s)6410 retrieve instructions to execute and data to process in order to execute the processes of the invention. The processing unit(s) may be a single processor or a multi-core processor in different embodiments.
The read-only-memory (ROM)6430 stores static data and instructions that are needed by the processing unit(s)6410 and other modules of the electronic system. Thepermanent storage device6435, on the other hand, is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when theelectronic system6400 is off. Some embodiments of the invention use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) as thepermanent storage device6435.
Other embodiments use a removable storage device (such as a floppy disk, flash drive, etc.) as the permanent storage device. Like thepermanent storage device6435, thesystem memory6425 is a read-and-write memory device. However, unlikestorage device6435, the system memory is a volatile read-and-write memory, such a random access memory. The system memory stores some of the instructions and data that the processor needs at runtime. In some embodiments, the invention's processes are stored in thesystem memory6425, thepermanent storage device6435, and/or the read-only memory6430. From these various memory units, the processing unit(s)6410 retrieve instructions to execute and data to process in order to execute the processes of some embodiments.
Thebus6405 also connects to the input andoutput devices6440 and6445. The input devices enable the user to communicate information and select commands to the electronic system. Theinput devices6440 include alphanumeric keyboards and pointing devices (also called “cursor control devices”). Theoutput devices6445 display images generated by the electronic system. The output devices include printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD). Some embodiments include devices such as a touchscreen that function as both input and output devices.
Finally, as shown inFIG. 64,bus6405 also coupleselectronic system6400 to anetwork6465 through a network adapter (not shown). In this manner, the computer can be a part of a network of computers (such as a local area network (“LAN”), a wide area network (“WAN”), or an Intranet, or a network of networks, such as the Internet. Any or all components ofelectronic system6400 may be used in conjunction with the invention.
Some embodiments include electronic components, such as microprocessors, storage and memory that store computer program instructions in a machine-readable or computer-readable medium (alternatively referred to as computer-readable storage media, machine-readable media, or machine-readable storage media). Some examples of such computer-readable media include RAM, ROM, read-only compact discs (CD-ROM), recordable compact discs (CD-R), rewritable compact discs (CD-RW), read-only digital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a variety of recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.), flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.), magnetic and/or solid state hard drives, read-only and recordable Blu-Ray® discs, ultra density optical discs, any other optical or magnetic media, and floppy disks. The computer-readable media may store a computer program that is executable by at least one processing unit and includes sets of instructions for performing various operations. Examples of computer programs or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter.
While the above discussion primarily refers to microprocessor or multi-core processors that execute software, some embodiments are performed by one or more integrated circuits, such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). In some embodiments, such integrated circuits execute instructions that are stored on the circuit itself.
As used in this specification, the terms “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms display or displaying means displaying on an electronic device. As used in this specification, the terms “computer readable medium,” “computer readable media,” and “machine readable medium” are entirely restricted to tangible, physical objects that store information in a form that is readable by a computer. These terms exclude any wireless signals, wired download signals, and any other ephemeral signals.
While the invention has been described with reference to numerous specific details, one of ordinary skill in the art will recognize that the invention can be embodied in other specific forms without departing from the spirit of the invention. In addition, a number of the figures (includingFIGS. 14, 16, 32, 35, 49, 51, and 52) conceptually illustrate processes. The specific operations of these processes may not be performed in the exact order shown and described. The specific operations may not be performed in one continuous series of operations, and different specific operations may be performed in different embodiments. Furthermore, the process could be implemented using several sub-processes, or as part of a larger macro process. Thus, one of ordinary skill in the art would understand that the invention is not to be limited by the foregoing illustrative details, but rather is to be defined by the appended claims.