CLAIM OF PRIORITYThe present application claims priority from Japanese Application JP 2007-008220 filed on Jan. 17, 2007, the content of which is hereby incorporated by reference into this application.
BACKGROUND OF THE INVENTIONThe present invention relates to a virtual machine system, and technology for sharing IO (Input/Output) devices among plural virtual servers.
A virtual machine system is widely known in which plural virtual servers are configured on one computer, and an operating system (OS) is individually run on each server. To run a large number of virtual severs in a virtual machine system, IO devices must be shared among the virtual servers.
As technology for sharing IO devices among virtual servers, a method of emulating the IO devices by software is known. A method disclosed by U.S. Pat. No. 6,496,847 provides virtual IO devices for an OS on a virtual server. A virtual machine monitor (hereinafter referred to as VMM) receives accesses to virtual IO devices and transfers them to a host OS, which centrally manages accesses to physical IO devices.
As another technology for sharing IO devices among virtual servers, a method of using arbitration hardware that arbitrates accesses to IO devices among virtual servers is known. A method disclosed by Japanese Patent Application Laid-Open Publication No. 2005-122640 monitors writing to a memory map IO (MMIO) register, and performs access to a physical IO device upon a write to a specific register.
BRIEF SUMMARY OF THE INVENTIONHowever, the related art disclosed in U.S. Pat. No. 6,496,847 does not control IO accesses among virtual servers with priority and QoS (Quality of Service) in mind, so that IO bands cannot be specified according to the priority of services run on the virtual servers.
Placing certain limitations on IO accesses by software components such as VMM could not be said to be sufficient in terms of future wider bands of IO devices because performance overhead on IO processing increases.
On the other hand, in the related art described in Japanese Patent Application Laid-Open Publication No. 2005-122640, since plural virtual servers access arbitration hardware at the same time, it is not realized to perform arbitration according to IO priority of virtual servers.
The present invention has been made in view of the above-described problems, and its object is to provide a machine system that realizes the arbitration of IO accesses and band control based on the priority of virtual servers while curbing performance overhead during IO sharing among the virtual servers.
The present invention is a machine including a central processing unit (hereinafter referred to as CPU), a memory, and an IO interface. The machine is configured to include a hypervisor that generates plural virtual servers, and an IO controller that controls the IO interface, wherein the IO controller includes a DMA receiving unit that receives DMA (Direct Memory Access) requests from the IO interface, a first decoder that decodes a received DMA request and locates a corresponding virtual server, a DMA monitoring counter that monitors a DMA processing status for each of virtual servers, a threshold register set in advance for each of virtual servers, and a priority deciding circuit that compares the DMA monitoring counter and the value of the threshold register, and decides the priority of processing of the received DMA request.
In a computer including a CPU, a memory, and IO devices, the computer includes a hypervisor that generates plural virtual servers, and the IO devices include a DMA request issuing unit that issues DMA requests, a DMA monitoring counter that monitors a DMA issuance status for each of virtual servers, a threshold register set in advance for each of virtual servers, and a priority deciding circuit that compares the DMA monitoring counter and the value of the threshold register, and decides the priority of a DMA request to be issued.
In the present invention, the IO controller or IO devices monitors DMA processing status or issuance status for each of virtual servers by comparing with threshold. By this construction, because of independence from software such as VMM, the arbitration of IO accesses and band control based on the priority of the virtual servers are enabled while curbing performance overhead during IO sharing.
BRIEF DESCRIPTION OF THE DRAWINGSThese and other features, objects and advantages of the present invention will become more apparent from the following description when taken in conjunction with the accompanying drawings wherein:
FIG. 1 is a block diagram showing an example of a computer configuration that the present invention presupposes;
FIG. 2 is a block diagram showing the chipset structure of a computer shown inFIG. 1;
FIG. 3 is a block diagram showing the structure of main units of a first embodiment of the present invention;
FIG. 4 is a block diagram showing a first example of implementing a DMA flow rate monitoring circuit in a first embodiment;
FIG. 5 is a block diagram showing an example of a Posted/Non-Posted priority deciding circuit in a first embodiment;
FIG. 6 is a block diagram showing a second example of implementing a DMA flow rate monitoring circuit in a first embodiment;
FIG. 7 is a block diagram showing a hypervisor structure in a first embodiment;
FIG. 8 is a flowchart drawing showing the flow of processing in hypervisor operation at notification of DMA flow rate over in a first embodiment;
FIG. 9 is a flowchart showing the flow of processing in DMA flow rate over release operation in a first embodiment;
FIG. 10 is a drawing showing an example of a user interface in a first embodiment;
FIG. 11 is a block diagram showing the structure of a second embodiment of the present invention;
FIG. 12 is a drawing showing a table of correspondences between virtual server numbers and VCs;
FIG. 13 is a block diagram showing the structure of a third embodiment of the present invention;
FIG. 14 is a block diagram showing an embodiment of an arbitrating circuit in a third embodiment; and
FIG. 15 is a block diagram showing the structure of a fourth embodiment of the present invention.
DETAILED DESCRIPTION OF THE INVENTIONHereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.
First EmbodimentFIG. 1 shows an example of the structure of a virtual machine system that embodiments including a first embodiment presuppose. The computer mainly compriseshardware components1001 andsoftware components1002.
Thehardware components1001 includeCPUs1003aand1003bas processing units, amemory1004 as a storing unit, andIO devices1005, which are mutually connected via achipset1006. Thechipset1006 is connected withCPUs1003aand1003bthrough a CPU bus1010, with thememory1004 through amemory interface1011, and with theIO devices1005 through anIO interface1012 and an extendedIO slot1013. TheIO devices1005 are further connected with HDD (Hard Disk Drive)1014 or anetwork1015.
Thechipset1006 is internally divided into aCPU bus controller1007, amemory controller1008, and anIO controller1009, which respectively control the CPU bus1010,memory interface1011, andIO interfaces1012, which are connected with the chipset. Although the number of individual components of thehardware components1001 is one or two for convenience of the drawing, the present invention is not limited to them. PCI express link of industry standards, which is primarily presupposed as theIO interface1012, can also apply to other IO buses and IO ports without being limited to it.
Thesoftware components1002 include ahypervisor1020, avirtual servers1021aand1021b. Thehypervisor1020, which generates and controls thevirtual servers1021aand1021b, is connected to amanagement terminal1024 and receives operations from aserver manager1025. Theserver manager1025 directs the generation of virtual servers and the allocation of thehardware components1001 to the virtual servers. In thevirtual server1021a, oneguest OS1022 and one ormore guest applications1023 operate. Although only twovirtual servers1021aand1021bare shown for convenience of the drawing, the present invention is not limited to the above; three or more, or only one are also permitted.
FIG. 2 shows an internal structure of thechipset1006 ofFIG. 1, and details, particularly, the periphery of theIO controller1009. TheIO controller1009 is connected with theCPU bus controller1007 and thememory controller1008 through an IO to CPU/memory communication interface1104 and a CPU/memory toIO communication interface1105, respectively.
TheIO controller1009, which is internally divided into an inbound (receiving side)control subunit1101 and an outbound (sending side)control subunit1102, is connected with theIO interface1012 through anIO interface arbiter1103.
Theinbound control subunit1101 receives transactions (hereinafter simply referred to as Tx) from theIO interface1012, and transfers them to the IO to CPU/memory communication interface1104. In embodiments of the present invention described below, theinbound control subunit1101 further communicates with thehypervisor1020 through a hypervisor-orientedcommunication interface1106. As methods of implementing the hypervisor-oriented communication interface1106, plural implementations are possible, such as NMIO register, IO register, interrupt, data structure on the memory, and combinations of them. These implementation methods are not detailed here because they are technologies within a scope easily conceivable to hardware designers and hypervisor designers. Theoutbound control subunit1102 receives Tx from the CPU/memory toIO communication interface1105, and transfers it to theIO interface1012.
FIG. 3 shows the structure of main units of the first embodiment, and discloses the internal structures of theinbound control subunit1101 and thesoftware components1002. The above-described hypervisor-orientedcommunication interface1106, which internally includes two interfaces, a flow rate overcommunication interface1355 and aregister operation interface1356, connects theinbound control subunit1101 and thehypervisor1020. Information about virtual servers in which flow rate over occurs is transmitted from the flow rate overcommunication interface1355.
Theinbound control subunit1101 receives Tx from theIO interface arbiter1103 in a TLP (Transaction Layer Packet) decoder &MUX1301. The TLP decoder &MUX1301, which is a DMA receiving unit to receive DMA requests, classifies Tx received according to PCI express rules into three types, Postedrequest1302,Non-Posted request1303, andCompletion request1304. It is conceivable that Tx occupying most part of band control of IO access to be solved in this embodiment is DMA write requests and DMA read requests. Accordingly, a policy of the present invention is to subject the Postedrequest1302 including a DMA write request and theNon-Posted request1303 including a DMA read request to arbitration processing based on the priority of virtual servers.
The Postedrequest1302 is stored in one of HQ (Higher-prioritized Queue)1307a, LQ (Lower-prioritized Queue)1308a, and SoQ (Strong-ordered Queue)1309 via a postedpriority deciding circuit1305. Thequeues HQ1307a,LQ1308a, and SoQ1309 have higher processing priority in that order.
On the other hand, theNon-Posted request1303 is stored in one ofHQ1307bandLQ1308bvia a Non-Postedpriority deciding circuit1306. Thequeues HQ1307bandLQ1308bhave higher processing priority in that order.
Any of thepriority deciding circuits1305 and1306 to function as a priority deciding unit decides a storage destination of a received request according to the value ofprocessing priority1323 generated by the DMA flowrate monitoring circuit1317. In this embodiment, when the value of theprocessing priority1323 is 0, a low priority is assigned to received requests, and when 1, a high priority is assigned to them.
The DMA flowrate monitoring circuit1317 decidesprocessing priority1323 according to the Postedrequest1302 and theNon-Posted request1303, a virtual server number (hereinafter referred to as a VM number of VM#)1322 generated by theVM information decoder1321, and information set from the hypervisor through theregister operation interface1356.
TheVM information decoder1321 consults the header of request Tx of the Postedrequest1302 and theNon-Posted request1303, and a value set by theregister operation interface1356 to locate a virtual serve corresponding to these requests, and outputs aVM number1322. Plural methods are conceivable to implement theVM information decoder1321 that functions as the virtual server locating unit. For example, part of address bits in the header of requests is regarded as a VM number, or a corresponding VM number is held for each address range, and it is checked for each Tx reception.
Several methods are conceivable to implement the DMA flowrate monitoring circuit1317. Two implementation examples are described usingFIGS. 4 and 6.
FIG. 4 shows a first example of implementing the DMA flowrate monitoring circuit1317. In this example, the DMA flowrate monitoring circuit1317 includes aDMA monitoring counter1319, and holdscredit information1405aand1405bindicating how many requests can be internally received subsequently, for each virtual server. Since each piece of information (latch) of thecredit information1405aand thecredit information1405b, and peripheral setting circuits are identical, only thecredit information1405awill be detailed.
The credit information is set from thehypervisor1020 via credit setting CMD1402 and acredit setting value1401 that derives from theregister operation interface1356.
The credit information is decremented at the reception of each of the Postedrequest1302 and theNon-Posted request1303, and is incremented whenTx completion information1316 indicating completion of processing of each request is asserted. Only one piece of the credit information is operated at the decrement and increment, and information corresponding to a virtual server located by aVM number1322 is selectively operated.
When credit information setting bycredit setting CMD1402, and the above-described decrement and increment operations are not performed, creditinformation setting SEL1407 selects default data, and previous credit information is kept. The credit information can, in any case, be read from thehypervisor1020 via theregister operation interface1356.
The DMA flowrate monitoring circuit1317 holds information on the number of DMAs that can preferentially be processed for each of virtual servers inthreshold register values1406aand1406bin thethreshold register1320. Thethreshold register values1406aand1406bare provided for each of virtual servers, and are set from the hypervisor via threshold setting CMD1404 and athreshold setting value1403 that derives from theregister operation interface1356. In the drawing, a threshold value “4” is set in1406a, and a threshold value “6” is set in1406b.
The DMA flowrate monitoring circuit1317 includes acomparator1318 to compare credit information and a threshold register value. The credit information and the threshold register value to be compared are specified by aVM number1322, and creditinformation selection SEL1408 and thresholdregister selection SEL1409 select the comparison target.
Thecomparator1318 determines that, when the credit information is smaller than the threshold register value, an IO band set in a corresponding virtual server is exceeded. The reason is that the number of DMA processings in progress increases as a result of reception into more than an estimated value, so that credit may have become smaller than estimation. In this case, 1 is asserted to theprocessing priority1323 and the received request is stored in a queue of lower priority. At the same time when a low priority is selected as the proper priority, the flow rate overnotification interface1355 is asserted to notify thehypervisor1020 that any virtual server has exceeded an IO band. In the circuit configuration ofFIG. 4, although only the flow rate overcommunication interface1355 asserted is shown in the drawing, as described previously, the virtual server number (VM number1322) of a corresponding virtual server number is transmitted to thehypervisor1020 via theinterface1355 at the same time as the assertion. The implementation method is not detailed here because it is technology within a range easily conceivable to hardware designers.
On the other hand, when the credit information is equal to or greater than the threshold register value in thecomparator1318, thecomparator1318 determines that the IO band set in the corresponding virtual server is not exceeded. In this case, 0 is outputted to theprocessing priority1323, and the received request is stored in a queue of higher priority.
FIG. 6 shows a second example of implementing the DMA flow rate monitoring circuit. In this example, theDMA monitoring counter1319 includes data payload length counters1604aand1604b, and holds the accumulated value of payload lengths of received requests including DMAs having been already processed for each of virtual servers.
The data payload length counters1604aand1604bcan be reset from thehypervisor1020 via areset signal1601. Thehypervisor1020, the structure of which is described inFIG. 7, periodically resets the counter and monitors DMA request amounts received per unit time.
The data payload length counters1604aand1604bare counted up at reception of therequests1302 and1303. An added value is data payload length included in the Tx header of therequests1302 and1303, and determined by adecoder1607. When therequests1302 and1303 are asserted, addCMD1603 is asserted corresponding to aVM number1322, and the datapayload length counter1604aor1604bis added. When theadd CMD1603 is not asserted, previous information is kept.
In the example ofFIG. 6, the accumulated value of DMA payload length that can be preferentially processed is held in thethreshold register values1406cand1406din units of DW (Double Word: 4 byte). InFIG. 6, 1024 is stored in thethreshold register1406c, and 2,048 is stored in thethreshold register1406d, respectively indicating that DMA requests of up to 1024 DW (4 KB) and 2048 DW (8 KB) can be preferentially processed.
Acomparator1318bdetermines, when the value of the data payload length counter is greater than the threshold register value, that IO band set in a corresponding virtual server is exceeded. In other cases, it determines that IO band is not exceeded. The assertion of theprocessing priority1323 and the flow rate overcommunication interface1355 is the same as that of the first example shown inFIG. 4.
The examples of implementing the DMA flow rate monitoring circuit by use ofFIG. 4 andFIG. 6 are as described above. It has been shown that the monitoring of DMA flow rate occupying the most part of IO band can be performed according to threshold values set for each of virtual servers by using any of the methods described in the examples. As conceivable variants of the DMA flow rate monitoring circuit, threshold data is increased to output priority of plural levels, DMA flow rate is determined using data credit, and a data payload length counter is periodically reset only within the DMA flowrate monitoring circuit1317. However, any of these variants are easily inferable to circuit designers and hypervisor designer from the descriptions of the above-described implementation examples, and therefore are not described here.
With reference toFIG. 5, the following describes an example of implementing the Posted/Non-Postedpriority deciding circuits1305 and1306 being the priority deciding unit ofFIG. 3. The Postedrequest1302 is assigned three levels of priority according to the type of request and theprocessing priority1323, and enqueued in different queues according to the levels. The priority is decided by the Postedpriority deciding circuit1305.
The Postedpriority deciding circuit1305 decodes a received request by anattribute decoder1501 to determine whether Strong Ordered attribute is specified. The PCI express protocol states that Posted requests with Strong Ordered attribute specified must not overtake any preceding Posted requests. Accordingly, when a Posted request with Strong Ordered specified is received, a Strong Orderedsignal1502 is asserted. By this signal, a PostedSoQ enqueue signal1326 is asserted regardless of theprocessing priority1323, and the received Postedrequest1302 is enqueued in the SoQ1309 of a subsequent stage.
When the Strong Orderedsignal1502 is not asserted, anenqueue signal1324 or1325 is asserted according to theprocessing priority1323. When theprocessing priority1323 is 0, that is, when priority is high, an PostedHQ enqueue signal1324 is asserted, and the received Postedrequest1302 is stored inHQ1307aof subsequent stage. On the other hand, when theprocessing priority1323 is 1, that is, when priority is low, the PostedLQ enqueue signal1324 is asserted, and the received Postedrequest1302 is enqueued inLQ1308aof subsequent stage.
On the other hand, theNon-Posted request1303 is assigned two levels of priority according to theprocessing priority1323, and is enqueued in different queues according to the levels. The priority is decided by the Non-Postedpriority deciding circuit1306. In the Non-Postedpriority deciding circuit1306, anenqueue signal1327 or1328 is asserted according to theprocessing priority1323. When theprocessing priority1323 is 0, that is, when priority is high, a Non-PostedHQ enqueue signal1327 is asserted, and the receivedNon-Posted request1303 is stored inHQ1307bof subsequent stage. On the other hand, when theprocessing priority1323 is 1, that is, when priority is low, the Non-PostedLQ enqueue signal1328 is asserted, and the receivedNon-Posted request1303 is enqueued inLQ1308bof subsequent stage.
The Posted request enqueued in each of thequeues1307a,1308a, and1309, and the Non-Posted request enqueued in each of thequeues1307band1308bare processed in order via the Posted arbitratingcircuit1310, and the Non-Posted arbitrating circuit1311, respectively, and are enqueued inPQ1312 andNPQ1313.
The Postedarbitrating circuit1310 preferentially processes1307a,1308a, and1309 in that order. This priority level is fixed. On the other hand, the Non-Posted arbitrating circuit1311 preferentially processes1307band1308bin that order, and the priority level is fixed. Tx stored in thePQ1312,NPQ1313, andCQ1314 is sent to the IO to CPU/memory communication interface1104 via thearbitrating circuit1315.
FIG. 14 shows an example of thearbitrating circuit1315. the arbitrating circuit preferentially sends out Tx from thePQ1312,CQ1314, andNPQ1313 in that order, and complies with the PCI express. Confirm that combinations with the priority levels of the arbitratingcircuits1310 and1311 described previously do not violate the PCI express ordering rules.
After the completion of Tx issued from thePQ1312 andNPQ1313,Tx completion information1316 is asserted and passed to the DMA flowrate monitoring circuit1317.
In the first embodiment described above, the processing priority of Posted request is set to three levels, and the processing priority of Non-Posted request is set to two levels. However, any of them can be set to any number of two levels or more. Specifically, conceivable constructions are to share theLQ1308aand SoQ1309 for processing at priority of two levels to reduce a required circuit scale, or to perform more detailed priority control by dividing thethreshold register1320 into plural sections to output theprocessing priority1323 at plural levels. Since any of the constructions is easily inferable to circuit designers, their descriptions are omitted here.
The following describes the internal structure of thesoftware components1002 at the right side ofFIG. 3 relating to the first embodiment. Thehypervisor1020 internally includes a CPU allocation control (unit)1350, an interrupt notifyingunit1351, and aregister setting unit1353. These functional blocks respectively control allocation of CPU time to thevirtual servers1021aand1021b, notify thevirtual servers1021aand1021bof virtual interrupts, and set and consult registers of the DMA flowrate monitoring circuit1317 and theVM information decoder1321 in a DMApriority control circuit1330.
FIG. 7 shows an example of the internal functional structure of the hypervisor of the present invention. Thehypervisor1020 internally includes aCPU allocation control1350, an interruptnotification unit1351, and aregister setting unit1353. TheCPU allocation control1350 internally includes anotification reception1701 and a CPUallocation rate control1702. Thenotification reception1701, when an IO band allocated to virtual servers is exceeded, receives notification from the flow rate overcommunication interface1355 and information about a corresponding virtual server. Thenotification reception1701 suppresses DMA issuance to correspondingvirtual servers1021aand1021bvia a CPU allocationrate control request1703 and a virtual interrupt holdingrequest1711.
The CPUallocation rate control1702 controls CPU time allocated to thevirtual servers1021aand1021bvia theCPU time allocation1705. TheCPU time allocation1705 assumes an execution start instruction and interrupt operation of OS code on the virtual servers, such as VMEntry and VMExit in Intel's virtual-server-oriented technology VT-x (Intel® virtualization Technology Specification for the IA-32 Intel® Architecture). The CPUallocation rate control1702, when receiving a CPU allocationrate control request1703, decreases CPU time allocation rate to a relevant virtual server. For example, virtual 50% of CPU time assigned to servers during initial setting is decreased to 10%.
The interruptnotification unit1351 internally includes an interrupt holdingunit1704 and an interruptdetection1707. The interrupt holdingunit1704 controls virtual interrupt report to virtual servers, and starts an interrupt handler of theguest OS1022 via a virtual interruptnotification1706. The virtual interruptnotification1706 calls the interrupt handler of an OS on a virtual serve, such as an Event Injection function in the VT-x specification described above, and functions in conjunction with VMEntry execution by the above-mentionedCPU time allocation1705. The interrupt holdingunit1704, when receiving the virtual interrupt holdingrequest1711, temporarily holds the notification of a virtual interrupt to a relevant virtual server.
TheCPU time allocation1705 and the virtual interruptnotification1706 are collectively referred to as aDMA suppression interface1354. By operating the interface, the decrease of CPU time allocated to the virtual servers and virtual interrupt report are held, and a DMA request issued by theguest OS1022 is suppressed. This processing decreases IO band used by a relevant virtual server and eliminates the excess of predetermined threshold values.
The periodical interruptdetection1707 is a module that is periodically started and issues a DMA status reevaluation request1714 and a data payloadcounter reset request1715. There are plural methods for realizing periodical start; by using as a trigger timer interrupt from hardware such as PIT (Programmable Interval Timer) and RTC (Real Time Clock), and by monitoring a time stamp counter (TSC) within a CPU for polling. Any of them may be adopted.
The following describes a structure for releasing the suppression of DMA requests in thehypervisor1020 ofFIG. 7. Theregister setting part1353 internally includes a DMA status evaluation1709, a datapayload counter reset1710, and aregister initialization1708.
The DMA status evaluation1709, on receiving a DMA status revaluation request1714 periodically issued from the periodical interruptdetection1707, evaluates IO bands used by virtual servers, and issues a request to release DMA suppression to a virtual server in which the excess of IO band used has been eliminated. To evaluate IO bands, the DMA status evaluation1709 reads theDMA monitoring counter1319 via theregister operation interface1356, and compares it with threshold set upinformation1716 described later. The comparison is performed in the same way as the comparison processings1318 and1318bdescribed previously.
The DMA status evaluation1709, to release DMA suppression, issues a CPU allocationrate recovery request1712 and a virtual interrupt holdingrelease request1713. On receiving the requests, the CPUallocation rate control1702 and the interrupt holdingunit1704 recover the allocation rate of decreased CPU time and restarts the notification of held virtual interrupts, respectively. By this processing, DMA requests by theguest OS1022 can be restarted.
The datapayload counter reset1710 is used when the second example of the DMA flow rate monitoring circuit described inFIG. 6 is adopted. Triggered by a data payloadcounter reset request1715 periodically issued from the periodical interruptdetection1707, the datapayload counter reset1710 resets the data payload length counters1604aand1604bdescribed previously. By this processing, when the data payloadcounter reset request1715 is issued, for example, every 1 ms, thethreshold register values1406cand1406ddescribed inFIG. 6 permit preferential use of IO band of 4 MB per second and 8 MB per second, respectively.
Theregister initialization unit1708 is a module for initializing the above-describedDMA monitoring counter1319 andthreshold register1320, and internally includes athreshold setting information1716. Thethreshold setting information1716 is specified by the manager through themanagement terminal1024. An example of a user interface displayed in themanagement terminal1024 is described later usingFIG. 10.
With reference to a flowchart ofFIG. 8, the operation of thehypervisor1020 at DMA flow rate over notification in the first embodiment is described.
Step1801 receives notification via the flow rate overnotification interface1355. As described above, the interface includes virtual server information (VM number) with a proper priority selected, and all virtual servers in which DMA flow rate over occurs can be located inStep1802, based on the information.
Step1803 requests the CPUallocation rate control1702 to suppress a CPU allocation rate to relevant virtual servers.
InStep1804, the CPUallocation rate control1702 decreases a CPU allocation rate of the requested virtual servers.
Step1805 requests the Interrupt holdingunit1704 to temporarily hold interrupt notification to the relevant virtual servers.
InStep1806, the interrupt holdingunit1704 temporarily holds interrupt notification to the requested virtual serves.
Step1807 confirms whether all virtual servers in which DMA flow rate over occurs have completed the CPU allocation rate suppression and interrupt holding processing shown inSteps1803 to1806. Otherwise,Step1807 directs residual virtual servers to perform again the processings ofSteps1803 to1806. When the processings have been completed, the processings shown in this flowchart are terminated.
With reference to the flowchart ofFIG. 9, the following describes the operation of thehypervisor1020 at the release of DMA flow rate over in the first embodiment.
Step1901 starts the periodical interruptdetection1707.
Step1902 reads a current value of theDMA monitoring counter1319 via theregister interface1356.
Step1903 compares thethreshold setting information1706 and a current value of theDMA monitoring counter1319 for one virtual server.
Step1904 branches processing according to a comparison result ofStep1903. That is, when the excess of IO band used by the virtual server is not released, control branches to Step1909; when released, control goes toStep1905.
Steps1905 to1908 perform the release of DMA suppression for the server.
Step1905 requests theCPU allocation control1702 to recover CPU allocation rate.
InStep1906, theCPU allocation control1702 that has received the request recovers CPU allocation rate of the virtual server. For example, the virtual server with CPU allocation rate suppressed to 10% is recovered to an initial setting value, e.g., 50%.
Step1907 requests the virtual interrupt holdingunit1704 to release the holding of virtual interrupts.
InStep1908, the virtual interrupt holdingunit1704 that has received the request restarts virtual interrupt notification to the virtual server.
Step1909 determines whether the processings inSteps1903 to1908 have been completed for all the virtual servers. When not completed, the processings inSteps1903 to1908 are performed again for residual virtual servers. When completed, the processing shown in the flowchart is terminated.
FIG. 10 shows an example of a user interface displayed in themanagement terminal1024 shown inFIG. 1. The user interface shown by this drawing intends GUI (Graphical User Interface) using CRT (Cathode Ray Tube) and WWW (World Wide Web) browser and the like, and assumes operations using amouse pointer2050. However, it goes without saying that interfaces having the same setting items, even CLI (Command Line Interface), can be installed.
InFIG. 10,2001 is a resource allocation setting window that directs virtual servers to allocate computer resources. This window includes a resource allocation settingwindow operation bar2005, a CPUallocation setting tab2002, a memoryallocation setting tab2003, and IOallocation setting tab2004. Theoperation bar2005 is used to direct the iconifying and end of the resource allocation setting window itself2001.2002 indicates a tab for directing allocation of CPU resources to virtual servers,2003 indicates a tab for directing allocation of memory resources, and2004 is a tab for directing allocation of IO resources. In the drawing, an IOallocation setting tab2004 is selected.
Hereinafter, the IOallocation setting tab2004 that deals with IO resource allocation most related to this embodiment will be detailed. The IOallocation setting tab2004 includes an IOsharing setting window2007 and an IO device occupation settingwindow operation bar2040. The IOsharing setting window2007 is a window for setting virtual servers to be used in common for each physical IO device. The IO device occupation settingwindow operation bar2040 indicates a state in which a window is iconified; its detailed description is omitted.
The IOsharing setting window2007 includes an IO sharing settingwindow operation bar2006, a physical IO device sharingstatus confirmation window2010, and aNIC#0sharing setting window2020. The physical IO device sharingstatus confirmation window2010 displays a list of ready physical IO devices.FIG. 10 shows thatSCSI HBA#02011,FC HBA#02012,NIC#02013, andUSB2014 are ready for use.FC HBA#02012 is enclosed by a dotted line to indicate that it is being exclusively used by any virtual server. It cannot be operated within the window. InFIG. 10,NIC#02013 within thewindow2010 is displayed by a shaded list frame to indicate that it has been selected, and detailed setting ofNIC#02013 can be performed in theNIC#0 sharedsetting window2020.
TheNIC#0sharing setting window2020 includes IO sharedinformation title line2030,NIC#0 sharing setting2031 tovirtual server #0,NIC#0 sharing setting2032 tovirtual server #1,change approval button2021, and change cancelbutton2022. The manager changes elements within theNIC#0 sharing setting2031 tovirtual server #0 and theNIC#0 sharing setting2032 tovirtual server #1, then clicks thechange approval button2021, and thereby can change sharing settings among virtual servers. If the changes are incorrect, change contents can be canceled by clicking the change cancelbutton2022.
ForNIC#0 sharing setting2031 tovirtual server #0 andNIC#0 sharing setting2032 tovirtual server #1, corresponding virtual server numbers (VM#) are displayed respectively so that sharing on/off and DMA threshold can be set. The sharing field is a pull-down menu for setting whether a virtual server shares a relevant device (NIC#02013 in the example ofFIG. 10).
The DMA threshold field allows a user to set a preferentially usable IO band in the form of DMA threshold when using the relevant device. The example ofFIG. 10 shows setting values when the payload counters1604aand1604bshown inFIG. 6 are used; 1024 is set for thecounter1604acorresponding toVM#0, and 2,048 is set for thecounter1604bcorresponding toVM#1. In the setting field, by selecting a pair of upper-facing and lower-facing rectangular buttons by amouse pointer2050, setting values can be increased or decreased.
Although a DMA threshold value is directly set by the user in the user interface example ofFIG. 10, other highly readable indexes may be set as an alternative method. For example, a preferentially usable IO band may be directly set by a number such as “MB (MB/s)”. In this case, theDMA setting counter1319 finally set must be set without contradiction. However, this is control easily inferable from the descriptions of the above embodiments, and detailed descriptions are omitted.
Second EmbodimentThe following a second embodiment. This embodiment achieves a capping function to prohibit DMA requests consuming more IO band than specified by allocating virtual channels (VC) to virtual servers.
FIG. 11 shows the structure of main units of the second embodiment of the present invention, that is, the internal structure of theinbound control subunit1101 and thesoftware components1002. Theinbound control subunit1101 allocates one VC to one virtual server to make the control structure of Tx processing independent for each of virtual servers. Therefore, theunit1101 internally includes a VC-specific inboundTLP processing circuits2110band2110c, as well as aVC MUX2103, and aninter-VC arbitrating circuit2102, andVM information decoder2105.
TheVM information decoder2105 receives TC# (Traffic Class)2104 of the header of Tx received from theIO interface arbiter1103, and then locatesVC#2107 according to information of the W#-VC# correspondence table2106. The VM#-VC# correspondence2106 is set from thehypervisor1020 via the W#-VC# correspondencetable setting interface2108. The VM#-VC# correspondencetable setting interface2108 may be shared with theregister operation interfaces1356band1356c.
FIG. 12 shows an example of the data structure of VM#-VC# correspondence table2106. In this embodiment, a virtual server number (VM#) and TC# are stored in coincidence in one column. In this data structure, Tx of TC#(=VM#)=0 is associated withVC#0 by information of a line of VC# corresponding to avirtual server #02200, Tx of TC#(=VM#)=1 is associated withVC#1 by information of a line of VC# corresponding to avirtual server #12201.
TheVC MUX2103 decides a transfer destination of Tx received according to the value ofVC#2107. Specifically, whenVC#2107=0, the VC-specificTLP reception interface2109bis asserted, and whenVC#2107=1, the VC-specificTLP reception interface2109cis asserted.
The VC-specific inboundTLP processing circuits2110band2110cinclude a DMAcapping control circuit2101 that primarily functions as a PCI express TLPprocessing queue structure2111 and a DMA capping unit. The PCI express TLPprocessing queue structure2111 performs priority control according to the PCI express rules. Since the internal components of the PCI expressTLP processing queue2111 has already been described, its description is omitted here.
The DMAcapping control circuit2101 decides whether to permit the issuance of Tx outputted from the PCI express TLPprocessing queue structure2111 according toprocessing priority1323 outputted from a DMA flowrate monitoring circuit1317b. Specifically, when theprocessing priority1323 is 1 (low priority), it suppresses the issuance of Tx, and when 0 (high priority), it permits the issuance of Tx. By this processing, as long as the excess of an IO band set in the virtual server is not eliminated, a new DMA request cannot be issued. Thus, the capping function is implemented. The structure of the DMA flowrate monitoring circuit1317bconforms to the structure shown in the examples 1 and 2 of the DMA flow rate monitoring circuit inFIGS. 4 and 6, and its detailed description is omitted here.
Theinter-VC arbitrating circuit2102 arbitrates Tx issued from the VC-specific inboundTLP processing circuits2110band2110c, and sends it to the IO to CPU/memory communication interface1104. This arbitrating circuit provides no priority for processing between VCs, and performs fair arbitration such as round robin. Possible suppression of the issuance of new DMA because of the excess of an IO band in an arbitrary virtual server by this processing would not interfere with DMA of other virtual servers.
Third EmbodimentThe following describes a third embodiment. In this embodiment, virtual-server-specific IO band control is performed not in IO controllers but in IO devices.
FIG. 13 shows an internal structure of anIO device1005din this embodiment.FIG. 13 assumes NIC (Network Interface Card), which is connected to the outside through theIO interface1012 and thenetwork1015.
TheIO device1005dincludes anarbitrating circuit1315dforPQ1312d,NPQ1313d, andCQ1314dthat transmit Tx to theIO interface1012, and includesPQ1312e,NPQ1313e, andCQ1314ethat receive Tx from theIO interface1012. It also includes N/W packet transmission230 that transmits packets to thenetwork1015, and N/W packet reception2303 that receives packet.
In this embodiment, a DMAcapping control circuit2101dis provided as a component that controls IO bands. The DMAcapping control circuit2101dinternally includes a DMA flowrate monitoring circuit1317dand an AND element that decides whether to permit a request to aDMA Read issuance2307. The DMA flowrate monitoring circuit1317bconforms to the DMA flowrate monitoring circuits1317 and1317bin the first and second embodiments, and its detailed description is omitted here.
The DMAcapping control circuit2101ddecides whether to permit the issuance of aNon-Posted request1303ddirected for issuance by thesequencer2302 by specifying theprocessing priority1323. Theprocessing priority1323 is asserted from the DMA flowrate monitoring circuit1317din thecontrol circuit2101d, and when it is 1 (low priority), a request to theDMA Read issuance2307 is suppressed; when 0 (high priority), the issuance of a request to theDMA Read issuance2307 is permitted. By this circuit, when an IO band set for each of virtual servers is exceeded, the issuance of the DMA Read request is prohibited, so that capping can be realized.
Since processing cannot be discontinued for N/Wpacket storage requests2312 issued from the N/W packet reception2303, capping by theprocessing priority1323 is not applied.
When detecting the excess of an IO band, the DMA flowrate monitoring circuit1317dasserts a flow rate over interruptgeneration request2308. This request is converted into an interrupt Tx in the interruptgeneration2305, and finally passed to thehypervisor1020. Processing in the hypervisor is the same as that at reception of a request from the above-described flow rate overcommunication interface1355.
Thesequencer2302, which controls theentire IO device1005d, receives requests from the registeraccess control unit2301,Tx completion information1316d, andDMA Read Completion2310, and performs the issuance ofNon-Posted request1303dand the assertion of a sequencer interruptrequest2309.
TheNon-Posted request1303dis chiefly asserted upon packet transmission from theguest OS1022, for theIO interface1012, a DMA read request is sent andDMA Read Completion2310 is received, and finally, the N/W packet transmission2304 operates.
The sequencer interruptgeneration request2309 is a signal asserted upon the completion of the request from theguest OS1022 and data reception from the N/W packet reception2303. When this signal is asserted, the request is converted into Tx by the interruptgeneration2305, and finally notified to theguest OS1022. Theguest OS1022 recognizes an interrupt from theIO device1005d, withdraws DMA buffers, and transfers communication data to theguest application1023.
Theregister access unit2301 is activated fromregister access requests2311aand2311b. Since a Posted request and a Non-Posted request to theIO device1005dare chiefly read/write operation on registers of theIO device1005d, the DMA flowrate monitoring circuit1317dand thesequencer2302 are activated according to the register to be accessed. By this processing, reference to aDMA monitoring counter1319dfrom thehypervisor1020 and the setting of athreshold register1320dare realized. For register read requests, return data is returned to the CPU viaCQ1314d.
Fourth EmbodimentThe following describes a fourth embodiment. This embodiment assumes that a proprietary interface other than PCI express is used as an IO interface.
FIG. 15 shows an example of an internal structure of an inbound control subunit1005e. This embodiment is different from the first embodiment of the present invention shown inFIG. 3, in that Txes received in the inbound control subunit1005eare divided into only two systems, a request system Tx and a response system Tx, for processing. The forgoing assumes that Txes to request processing of DMA write and DMA read are contained in the request system Tx, and Txes of end report of DMA write and DMA read reply data are contained in the response system Tx.
The Tx reception &MUX2401 separates received Tx into arequest system Tx2402 and aresponse system Tx2403. Therequest system Tx2402 is stored in a requestsystem Tx queue2406 via a DMA priority control circuit1301e. Theresponse system Tx2403 is stored in a responsesystem Tx queue2407.
The DMA priority control circuit1303e, like the DMApriority control circuit1303, internally includes a DMA flowrate monitoring circuit1317eand aVM information decoder1321e. Since the DMA flowrate monitoring circuit1317eand theVM information decoder1321eare the same as those of the first embodiment shown inFIG. 3, detailed descriptions of them are omitted.
A request system priority setting circuit VM information decoder decides an enqueue destination according to theprocessing priority1323. That is, when theprocessing priority1323 is 0 (high priority), the received Tx is enqueued inHQ1307a, and when theprocessing priority1323 is 1 (low priority), it is enqueued inLQ1308a. A request systemTx arbitrating circuit2405 preferentially fetches Tx from theHQ1307aand enqueues it in the requestsystem Tx queue2406. The arbitration rules of the request systemTx arbitrating circuit2405 are fixed.
TheTx arbitrating circuit2408 arbitrates Txes stored in the requestsystem Tx queue2406 and the responsesystem Tx queue2407, and sends out them to the IO to CPU/memory communication interface1104. TheTx arbitrating circuit2408 always preferentially Txes from the responsesystem Tx queue2407.
It has been demonstrated from the foregoing that the present invention can apply to proprietary interfaces other than PCI express as well.
As has been described above, since the present invention enables the arbitration of IO accesses and band control based on the priority of virtual servers while curbing performance overhead during IO sharing among the virtual servers, finite IO resources can be appropriately allocated even in an information system in which a large number of virtual servers are required to operate.