VIRTUAL ADDRESSING FOR SUBSYSTEM DMA
FIELD OF THE INVENTION
The present invention relates generally to computer addressing systems and more particularly to an improved signal processing method for providing direct memory access for computer subsystems.
BACKGROUND OF THE INVENTION
The use and application of computer graphics to all kinds of systems and subsystems environments continues to increase to an even greater extent with the availability of faster and faster information processing and retrieval devices. The relatively higher speed of operation of such devices remains a high priority design objective. This is especially true in a graphics system, and even to a greater extent with "3D" graphics systems. Such graphics systems require a great deal of processing for huge amounts of data and the speed of data flow is critical in providing a marketable new product or system, or in designing graphics or other subsystems which may enable and drive new computer applications.
In many computer systems, the operating system (OS) generally utilizes the central processing unit (CPU) in a hardware-assisted virtual memory mode. The hardware allows the software to treat memory as a large (larger than the available physical memory) , virtualized (at least one level of indirection to physical memory address) object. This is despite the fact that memory is allocated and de-allocated in 4K byte granularity, and that consecutive 4K blocks m virtual memory have no relation to each other m terms of physical address. If all applications and operating systems could always obtain assignment of all of the physically contiguous memory that the OS or application requested, there would be no need for virtual memory. However, this is not the case m modern applications and substantially all computer systems require virtual memory and virtual memory management .
Virtual memory management software is not bound to keep the virtual translation scheme stationary over time. That is, virtual to physical address mapping can and does change over time as virtual addresses are swapped to disk and their corresponding physical memory locations are freed to be used by other virtual address regions. The memory management is usually done entirely within the purview of the operation system, software applications, and peripherals, and driver software is not involved m, nor informed of, these virtual address changes. The only interaction is the ability to request that certain regions of memory are locked-m for possible future use. Operating system suppliers usually recommend that memory not stay locked or dedicated indefinitely, partially because such memory dedication would cede too much control to device drivers and also degrade system performance. The system degradation is especially problematical in systems with only minimum memory installed.
Contrary to the CPU's ability to virtualize memory, the PCI bus m a computer system deals only m physical memory addresses. The PCI addresses correspond directly to the physical address decode in the PC core logic. Therefore, there is a need for translation from virtual memory as used by the CPU and its software, to physical memory as heeded by the physically addressed devices and memory on the PCI bus.
In all data and information processing systems, and especially in computer graphics systems, much time is consumed in accessing data from a memory or storage location, then processing that information and sending the processed information to another location for subsequent access, processing and/or display. As the speed of new processors continues to increase, access time for accessing and retrieving data from memory is becoming more and more of a bottleneck relative to available system speed. Subsystems such as graphics systems must be capable of performing more sophisticated functions in less time in order to process greater amounts of graphical data required by modern software applications. Thus, there is a continuing need for improvements in software methods and hardware implementations to accommodate operational speeds required by an expanding array of highly desired graphics applications and related special video effects.
SUMMARY OF THE INVENTION
A method and system are provided for implementing an information access and memory process by which memory page tables are assigned a stationary physical address in memory, and accessed directly at the assigned address by a graphics processor which effectively by-passes the system CPU and the CPU-related page table address translation iterations. BRIEF DESCRIPTION OF THE DRAWINGS
A better understanding of the present invention can' be obtained when the following detailed description of a preferred embodiment is considered in conjunction with the following drawings, in which:
Figure11 is a block diagram of a computer system including a graphics subsystem;
Figure 2 is block diagram of the graphics device shown in Figure 1 ;
Figure 3 is a process diagram illustrating a typical transaction in obtaining real address data from a virtual address request;
Figure 4 is an illustration of the addressing method implemented in the present example;
Figure 5 is a block diagram of several components of the graphics processor device shown in Figure 1;
Figure 6 is a schematic diagram illustrating the address translation unit shown in Figure 5;
Figure 7 is a flowchart showing the steps of the method implemented in the present example; and
Figure 8 is a flowchart showing the internal flow of the cache fill process shown in Figure 7. DETAILED DESCRIPTION
With reference to Figure 1, the various methods discussed above may be implemented within a typical computer system or workstation 101. An exemplary hardware configuration of a workstation which may be used in conjunction with the present invention is illustrated and includes a central processing unit (CPU) 103, such as a conventional microprocessor, and a number of other units interconnected through a system bus 105 such as a so called "PCI" bus. The bus 105 may include an extension 121 for further connections to other workstations or networks, other peripherals and the like. The workstation shown in Figure 1 includes system random access memory (RAM) 109, and a memory controller 107. The system bus 105 is also typically connected through a user interface adapter 115 to a keyboard device 111 and a mouse or other pointing device 113. Other user interface devices may also be coupled to the system bus 105 through the user interface adapter 115. A graphics device 117 is also shown connected between the system bus 105 and a monitor or display device 119. Since the workstation or computer system 101 within which the present invention is implemented is, for the most part, generally known in the art and composed of electronic components and circuits which are also generally known to those skilled in the art, circuit details beyond those shown in Figure 1, will not be explained to any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.
In Figure 2, the system bus 105 is shown connected to the graphics device 117. The graphics device is representative of many subsystems which may be implemented to take advantage of the benefits available from an implementation of the present invention. The exemplary graphics device 117 includes a graphics processor 201 which is arranged to process, transmit and receive information or data from a graphics memory unit 203. The graphics memory 203 may include, for example, a frame buffer unit for storing frame display information which is accessed by the graphics processor 201 and sent to the display device 119. The display device 119 is operable to provide a graphics display of the information stored in the frame buffer as processed by the operation of the graphics processor 201.
In Figure 3, there is shown an example of a typical system memory fetching operation after an address request is generated by a subsystem such as a graphics unit 301. The graphics unit 301 is connected to a system memory controller 303 which, in turn, is connected to system memory 305. System memory 305 in the illustration provides information back to the graphics unit 301. When an address request Rl is generated by the graphics unit 301, the system memory controller 303 processes that request and, in conjunction with a system operating system, addresses a page table portion of system memory 305. As applications are run on the computer system, the contents of the physical addresses of the system memory 305 are moved throughout the system memory and are present at different locations at different times depending upon the applications being run on the computer. The physical addresses and address content are kept track of by the operating system and maintained in a page table portion of the system memory 305.
As address requests are received by the system memory controller 303, the page table in system memory 305 is referred to in order to determine the address of the requested information at that particular time. As shown in Figure 3, the system then decodes the address and releases the data DTI to the system bus to be picked up by the subsystem or graphics unit 301 which requested the data. The data is typically requested by reference to a base address and a corresponding section beginning with that base address is returned to satisfy the request. In the event that sequential sections of the data stored at the base address have been separated and reside at non-sequential addresses when a request for that data is received, the graphics unit 301 will generate another request R1A to the memory controller 303 to access the memory 305 and the page table to locate the next segment of requested information R1A and send the data DT1A back to the requesting unit 301. That process generally continues and the requesting unit 301 may generate additional requests RIB and, in response, receive information DT1B from memory 305 until all of the requested information has been received. As can be seen, the operation of the memory controller 303 is required in all of the data fetches and the operation of the operating system limits the amount of data accessed at one time so that additional accesses are often required and the memory controller 303 is engaged at each request for data. As hereinafter illustrated, the present disclosure shows a method and apparatus through which such data accesses may be accomplished much faster and with less memory bandwidth usage thereby freeing-up the memory bandwidth to process other tasks.
In Figure 4, a system memory 400 is shown and includes a page table 401 and, in the present example, three locked pages 405, 407 and 409 at different addresses in system memory 401. In accordance with the present invention, the page table 401 from the main system memory 400 is copied and read into a locked copy location or dedicated base address 403 in system memory 400. In the preferred embodime t, the locked page table copy location 403 is located in the main system memory 400, but the copied page table may also be read into a dedicated base address in another memory system or subsystem. As can be seen in Figure 4, various page segments 405, 407 and 409 within the system memory are located through use of the copied page table 403 and accessed as sequential segments by the graphics processor, engine 501 shown in Figure 5. That process allows single access readout of a larger block of memory, including sequential segments, without requiring additional processing by the system memory controller 303.
Figure 5 illustrates various components within the graphics device 117 shown in Figure 1. A graphics processor engine 501 generates an address request for a virtual address which is sent to an address translation unit 503. The address translation unit 503 translates the requested virtual address into a physical address which may be recognized by the PCI bus 105 and the system CPU 103. The address request in a physical address format is sent to a PCI Interface unit 507 and applied to the PCI bus 105.
The address translation unit 503 is shown in detail in Figure 6. In the exemplary embodiment, there are three separate system or host memory apertures, which are designated HXY0 (Host XY aperture "0"), HXY1 (Host XY aperture "1") and PF (Prefetch Unit) . Each of the three memory apertures has an associated req_base, page table and cache designation. Each aperture represents a different view of the system and contains various aspects of the subsystem. In a graphics application for example, the HXY0 aperture may contain the Host XY, color and "Z" dimension information. Similarly, the HXYl aperture may contain linear transfer and host texture information and the PF aperture may contain so called "display list" information. In the exemplary embodiment, the requested address contains various bit segments which contain information about the requested address. For example, in the REQ_ADDR format, bits 21-12 (RA{21-12}) represent a page number, bits 21-14 (RA{21-14}) represent a page block, bits 13-12 (RA{ 13-12}) represent a cache entry in the block and bits 11-2 (RA{ll-2}) represent an offset in the page.
In Figure 6, three aperture-related base address registers 601, 603 and 605 are connected to a base address multiplexor device 607. Similarly, three aperture-related request address registers 609, 611 and 613 are connected to a corresponding request address multiplexer 615. Also, there are three aperture-related cache tag registers 617, 619 and 621 which are connected to a cache tag multiplexor 623. A aperture select circuit 625 provides an aperture select signal which is applied through a common connection 624 to the control terminals of the base address multiplexor 607, the request address multiplexor 615 and the cache tag multiplexor 623.
The output of the multiplexor 607 is applied to a request base register 627 and the output of multiplexer 615 is applied to a request address register 629. Similarly, the output from the multiplexer 623 is applied to a cache tag register 631. The register 631 is connected to a "B" input of a comparator circuit 633. Bits 21-14 from register 629 are applied to register 635 which is in turn applied to an "A" input to the comparator 633 and also to the 11-4 bit positions of another register 639. Bit positions 31-12 of register 639 receive an input fro bit positions 31-12 of the request base address register 627, and bit positions 3-2 of the register 639 receive an input from a Page Table Load State Machine (S.M.) as hereinafter explained in connection with Figure 8.
The output of register 639 is applied to one input of a select 'address multiplexor 637. The output from the select address multiplexor 637 is applied to the current PCI or physical address register 643. Bit positions 11-2 of register 629 are applied to the bit position 11-2 input of register 641, the output of which is applied to a second input of the multiplexor 637. The output of multiplexor 637 is controlled by a control input from the output of the comparator 633. Thirty-two bit PCI READBACK information in register 645 is applied to a series of three register files 647, 649 and 651. Each of the register files includes a set of four registers. Each set of four registers comprises a cache for each of the three apertures PF, HXYO and HXYl. Outputs from the register files 647, 649 and 651 are applied to a select cache multiplexor 653, the output of which is controlled by the 13-12 bit contents of the request address register 629. The output of the select cache multiplexor 653 is applied to register 655, the 31-12 bit output of which is applied to the 31-12 bit positions of register 641.
In operation, reference is made to the flow charts illustrated in Figure 7 and Figure 8. As used herein, a REQ_ADDR [21:2] command from the graphics engine represents a request for data from the physical address specified in bit position 31 through bit position 12 of the command from the page table in system memory. A REQ_BASE [31:12] command from the graphics unit represents a requested address from a 4 Megabyte aligned vertical addresses of system memory. A cache_tag [21:14] represents a 16K aligned address tag of the current four pages in the cache registers of the graphics unit. Cache_entry [x] [31 : 12] [y] commands represent the physical address of the designated page in cache with the "x" value standing for the page number of the value in cache, and the "y" value standing for whether or not a page is present. For example "cache_entry [2] [31 : 12] [1] " stands for the physical address specified in bit positions 31-12; with the "2" bit meaning that the third page (of pages "0", "1", "2" and "3") is present; and the "1" bit at the end position standing for the fact that there is a page present. Also in a 19 bit REQ_ADDR [21:2], bits 21 to 14 stand for the page block, bits 13 and 12 stand for the cache entry, and bits 11-2 stand for the offset from the beginning of the page.
In considering a typical data flow within the system, it is assumed that the graphics unit issues a REQ_ADDR [21:2] command to fetch data stored in memory for a graphics operation to be accomplished such as the filling-in of a pixel on the display in accordance with the appropriate color information as stored in system memory. In that case a request is made REQ_ADDR {21-2] for a host memory address access 701. At that time a check is made 703 to determine if the requested address is currently already stored in the graphics cache registers since if that is true, the requested page address information would be available in the graphics cache registers and an access to the page table copy in system memory would not be required. If the page address results in a cache hit, then data may be accessed in host memory directly. If there is a graphics cache miss, then the four page table addresses must be loaded into the cache and an access to the system memory is accomplished to
- II obtain the data (pixel in the present example) from system memory.
In the present example as hereinbefore noted, there is a cache tag register for each aperture PF, HXYO and HXYl. If the requested address is not contained in the graphics cache registers, then the system memory is accessed and the method executes a cache-fill operation 705 which is explained in more detail with reference to Figure 8. If, however, there is a "hit", i.e. the cache tag bits [21-14] of the REQ_ADDR command match with corresponding bits of the physical addresses currently stored in the graphics cache registers, or, if not, after a cache-fill operation is completed, the process continues by indexing the 13-12 bit contents of the REQ_ADDR statement into cache 707. Thereafter, a check is made 709 of the page present bits of the cache entry to insure that the appropriate page is present. If the appropriate page is not present 711 then a "page invalid" condition generates a "page default" and the process ends 713. If, however, the appropriate page is present, then the physical address is synthesized 715 and provided to the PCI bus 105 a second access to system memory is accomplished. If the page required by the REQ_ADDR [21:2] 629 is in the cache, then the physical address of the data word is built and that data word in system memory is accessed (read or write), and the process is completed 717. There will always be at least one access to system memory. The cache registers 647, 649 and 651 eliminate the need to read the page table on each access, which would require two memory accesses for each REQ_ADDR given.
In Figure 8, there is shown the cache fill state machine flow chart. Whenever, the REQ_ADDR is not present in the graphics cache registers as determined at step 703 by comparator 633, then the process seeks access to the PCI bus to access host memory. Until access is granted to the PCI bus, the method idles 801 and waits for such access. When there is a graphics cache register miss for a requested address and PCI access is granted 803, then the cache-fill process loads four physical page pointers from four consecutive addresses in the page table into the four cache entries. The process continues to load 805 data phase "0" which loads cache entry "0" and increments the PCI address to the next page table entry. That incrementing step loads the "3" bit position and the "2" bit position of the register 639 in Figure 6 such that those two bit positions are sequentially loaded with "00", "01" "10" and "11" combinations, respectively, as the iteration "i" is sequenced from "0" through "3". A check is then made 807 to determine whether the data is valid and if not the load step 805 continues until the data is determined to be valid. At that time, each of the three graphics cache registers is loaded 809. When all of the cache registers are loaded and the data is valid 811, the process waits for a PCI idle state 813, and when a PCI idle state is detected, the process returns to the idle state 801 to await the next concurrence of a graphics cache miss and a PCI grant 803.
The method and apparatus of the present invention has been described in connection with a preferred embodiment as disclosed herein. Although an embodiment of the present invention has been shown and described in detail herein, along with certain variants thereof, many other varied embodiments that incorporate the teachings of the invention may be easily constructed by those skilled in the art. Accordingly, the present invention is not intended to be limited to the specific form set forth herein, but on the contrary, it is intended to cover such alternatives, modifications, and equivalents, as can be reasonably included within the spirit and scope of the invention.