BACKGROUND1. Field of the Invention
The present invention relates generally to computers and software, and in particular to abstracting software libraries for a variety of different parallel hardware platforms.
2. Description of the Related Art
Computers and other data processing devices typically have at least one control processor that is generally known as a central processing unit (CPU). Such computers and devices can also have other processors such as graphics processing units (GPUs) that are used for specialized processing of various types. For example, in a first set of applications, GPUs may be designed to perform graphics processing operations. GPUs generally comprise multiple processing elements that are capable of executing the same instruction on parallel data streams. In general, a CPU functions as the host and may hand-off specialized parallel tasks to other processors such as GPUs.
Several frameworks have been developed for heterogeneous computing platforms that have CPUs and GPUs. These frameworks include BrookGPU by Stanford University, CUDA by NVIDIA, and OpenCL™ by an industry consortium named Khronos Group. The OpenCL framework offers a C-like development environment in which users can create applications to run on various different types of CPUs, GPUs, digital signal processors (DSPs), and other processors. OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous computing system. When using OpenCL, developers can use a single, unified toolchain and language to target all of the processors currently in use. This is done by presenting the developer with an abstract platform model that conceptualizes all of these architectures in a similar way, as well as an execution model supporting data and task parallelism across heterogeneous architectures.
OpenCL allows any application to tap into the vast GPU computing power included in many computing platforms that was previously available only to graphics applications. Using OpenCL it is possible to write programs which will run on any GPU for which the vendor has provided OpenCL drivers. When an OpenCL program is executed, a series of API calls configure the system for execution, an embedded Just In Time (JIT) compiler compiles the OpenCL code, and the runtime asynchronously coordinates execution between parallel kernels. Tasks may be offloaded from a host (e.g., CPU) to an accelerator device (e.g., GPU) in the same system.
A typical OpenCL-based system may take source code and run it through a JIT compiler to generate executable code for a target GPU. Then, the executable code, or portions of the executable code, are sent to the target GPU and are executed. However, this approach may take too long and it exposes the OpenCL source code. Therefore, there is a need in the art for OpenCL-based approaches for providing software libraries to an application within an OpenCL runtime environment without exposing the source code used to generate the libraries.
SUMMARY OF EMBODIMENTSIn one embodiment, source code and source libraries may go through several compilation stages from a high-level software language to an instruction set architecture (ISA) binary containing kernels that are executable on specific target hardware. In one embodiment, the high-level software language of the source code and libraries may be Open Computing Language (OpenCL). Each source library may include a plurality of kernels that may be invoked from a software application executing on a CPU and may be conveyed to a GPU for actual execution.
The library source code may be compiled into an intermediate representation prior to being conveyed to an end-user computing system. In one embodiment, the intermediate representation may be a low level virtual machine (LLVM) intermediate representation. The intermediate representation may be provided to end-user computing systems as part of a software installation package. At install-time, the LLVM file may be compiled for the specific target hardware of the given end-user computing system. The CPU or other host device in the given computing system may compile the LLVM file to generate an ISA binary for the hardware target, such as a GPU, within the system.
At runtime, the ISA binary may be opened via a software development kit (SDK) which may check for proper installation and may retrieve one or more specific kernels from the ISA binary. The kernels may then be stored in memory and an application executing may deliver each kernel for execution to a GPU via the OpenCL runtime environment.
These and other features and advantages will become apparent to those of ordinary skill in the art in view of the following detailed descriptions of the approaches presented herein.
BRIEF DESCRIPTION OF THE DRAWINGSThe above and further advantages of the methods and mechanisms may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
FIG. 1 is a block diagram of a computing system in accordance with one or more embodiments.
FIG. 2 is a block diagram of a distributed computing environment in accordance with one or more embodiments.
FIG. 3 is a block diagram of an OpenCL software environment in accordance with one or more embodiments.
FIG. 4 is a block diagram of an encrypted library in accordance with one or more embodiments.
FIG. 5 is a block diagram of one embodiment of a portion of another computing system.
FIG. 6 is a generalized flow diagram illustrating one embodiment of a method for providing a library within an OpenCL environment.
DETAILED DESCRIPTION OF EMBODIMENTSIn the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
This specification includes references to “one embodiment”. The appearance of the phrase “in one embodiment” in different contexts does not necessarily refer to the same embodiment. Particular features, structures, or characteristics may be combined in any suitable manner consistent with this disclosure. Furthermore, as used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to.
Terminology. The following paragraphs provide definitions and/or context for terms found in this disclosure (including the appended claims):
“Comprising.” This term is open-ended. As used in the appended claims, this term does not foreclose additional structure or steps. Consider a claim that recites: “A system comprising a host processor . . . .” Such a claim does not foreclose the system from including additional components (e.g., a network interface, a memory).
“Configured To.” Various units, circuits, or other components may be described or claimed as “configured to” perform a task or tasks. In such contexts, “configured to” is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs the task or tasks during operation. As such, the unit/circuit/component can be said to be configured to perform the task even when the specified unit/circuit/component is not currently operational (e.g., is not on). The units/circuits/components used with the “configured to” language include hardware—for example, circuits, memory storing program instructions executable to implement the operation, etc. Reciting that a unit/circuit/component is “configured to” perform one or more tasks is expressly intended not to invoke 35 U.S.C. §112, sixth paragraph, for that unit/circuit/component. Additionally, “configured to” can include generic structure (e.g., generic circuitry) that is manipulated by software and/or firmware (e.g., an FPGA or a general-purpose processor executing software) to operate in manner that is capable of performing the task(s) at issue. “Configured to” may also include adapting a manufacturing process (e.g., a semiconductor fabrication facility) to fabricate devices (e.g., integrated circuits) that are adapted to implement or perform one or more tasks.
“First,” “Second,” etc. As used herein, these terms are used as labels for nouns that they precede, and do not imply any type of ordering (e.g., spatial, temporal, logical) unless explicitly defined as such. For example, in a system with four GPUs, the terms “first” and “second” GPUs can be used to refer to any two of the four GPUs.
“Based On.” As used herein, this term is used to describe one or more factors that affect a determination. This term does not foreclose additional factors that may affect a determination. That is, a determination may be based solely on those factors or based, at least in part, on those factors. Consider the phrase “determine A based on B.” While B may be a factor that affects the determination of A, such a phrase does not foreclose the determination of A from also being based on C. In other instances, A may be determined based solely on B.
Referring now toFIG. 1, a block diagram of acomputing system100 according to one embodiment is shown.Computing system100 includes aCPU102, aGPU106, and may optionally include acoprocessor108. In the embodiment illustrated inFIG. 1,CPU102 andGPU106 are included on separate integrated circuits (ICs) or packages. In other embodiments, however,CPU102 andGPU106, or the collective functionality thereof, may be included in a single IC or package. In one embodiment,GPU106 may have a parallel architecture that supports executing data-parallel applications.
In addition,computing system100 also includes asystem memory112 that may be accessed byCPU102,GPU106, andcoprocessor108. In various embodiments,computing system100 may comprise a supercomputer, a desktop computer, a laptop computer, a video-game console, an embedded device, a handheld device (e.g., a mobile telephone, smart phone, MP3 player, a camera, a GPS device, or the like), or some other device that includes or is configured to include a GPU. Although not specifically illustrated inFIG. 1,computing system100 may also include a display device (e.g., cathode-ray tube, liquid crystal display, plasma display, etc.) for displaying content (e.g., graphics, video, etc.) ofcomputing system100.
GPU106assists CPU102 by performing certain special functions (such as, graphics-processing tasks and data-parallel, general-compute tasks), usually faster thanCPU102 could perform them in software.Coprocessor108 may also assistCPU102 in performing various tasks.Coprocessor108 may comprise, but is not limited to, a floating point coprocessor, a GPU, a video processing unit (VPU), a networking coprocessor, and other types of coprocessors and processors.
GPU106 andcoprocessor108 may communicate withCPU102 andsystem memory112 overbus114.Bus114 may be any type of bus or communications fabric used in computer systems, including a peripheral component interface (PCI) bus, an accelerated graphics port (AGP) bus, a PCI Express (PCIE) bus, or another type of bus whether presently available or developed in the future.
In addition tosystem memory112,computing system100 further includeslocal memory104 andlocal memory110.Local memory104 is coupled toGPU106 and may also be coupled tobus114.Local memory110 is coupled tocoprocessor108 and may also be coupled tobus114.Local memories104 and110 are available toGPU106 andcoprocessor108, respectively, in order to provide faster access to certain data (such as data that is frequently used) than would be possible if the data were stored insystem memory112.
Turning now toFIG. 2, a block diagram illustrating one embodiment of a distributed computing environment is shown.Host application210 may execute onhost device208, which may include one or more CPUs and/or other types of processors (e.g., systems on chips (SoCs), graphics processing units (GPUs), field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs)).Host device208 may be coupled to each ofcompute devices206A-N via various types of connections, including direct connections, bus connections, local area network (LAN) connections, internet connections, and the like. In addition, one or more ofcompute devices206A-N may be part of a cloud computing environment.
Compute devices206A-N are representative of any number of computing systems and processing devices which may be coupled tohost device208. Eachcompute device206A-N may include a plurality ofcompute units202. Eachcompute unit202 may represent any of various types of processors, such as GPUs, CPUs, FPGAs, and the like. Additionally, eachcompute unit202 may include a plurality ofprocessing elements204A-N.
Host application210 may monitor and control other programs running oncompute devices206A-N. The programs running oncompute devices206A-N may include OpenCL kernels. In one embodiment,host application210 may execute within an OpenCL runtime environment and may monitor the kernels executing oncompute devices206A-N. As used herein, the term “kernel” may refer to a function declared in a program that executes on a target device (e.g., GPU) within an OpenCL framework. The source code for the kernel may be written in the OpenCL language and compiled in one or more steps to create an executable form of the kernel. In one embodiment, the kernels to be executed by acompute unit202 of compute device206 may be broken up into a plurality of workloads, and workloads may be issued todifferent processing elements204A-N in parallel. In other embodiments, other types of runtime environments other than OpenCL may be utilized by the distributed computing environment.
Referring now toFIG. 3, a block diagram illustrating one embodiment of an OpenCL software environment is shown. A software library specific to a certain type of processing (e.g., video editing, media processing, graphics processing) may be downloaded or included in an installation package for a computing system. The software library may be compiled from source code to a device-independent intermediate representation prior to being included in the installation package. In one embodiment, the intermediate representation (IR) may be a low-level virtual machine (LLVM) intermediate representation, such asLLVM IR302. LLVM is an industry standard for a language-independent compiler framework, and LLVM defines a common, low-level code representation for the transformation of source code. In other embodiments, other types of IRs may be utilized. DistributingLLVM IR302 instead of the source code may prevent unintended access or modification of the original source code.
LLVM IR302 may be included in the installation package for various types of end-user computing systems. In one embodiment, at install-time,LLVM IR302 may be compiled into an intermediate language (IL)304. A compiler (not shown) may generateIL304 fromLLVM IR302.IL304 may include technical details that are specific to the target devices (e.g., GPUs318), althoughIL304 may not be executable on the target devices. In another embodiment,IL304 may be provided as part of the installation package instead ofLLVM IR302.
Then,IL304 may be compiled into the device-specific binary306, which may be cached byCPU316 or otherwise accessible for later use. The compiler used to generate binary306 from IL304 (andIL304 from LLVM IR302) may be provided toCPU314 as part of a driver pack forGPUs318. As used herein, the term “binary” may refer to a compiled, executable version of a library of kernels.Binary306 may be targeted to a specific target device, and kernels may be retrieved from the binary and executed by the specific target device. The kernels from a binary compiled for a first target device may not be executable on a second target device.Binary306 may also be referred to as an instruction set architecture (ISA) binary. In one embodiment,LLVM IR302,IL304, and binary306 may be stored in a kernel database (KDB) file format. For example, file302 may be marked as a LLVM IR version of a KDB file, file304 may be an IL version of a KDB file, and file306 may be a binary version of a KDB file.
The device specific binary306 may include a plurality of executable kernels. The kernels may already be in a compiled, executable form such that they may be transferred to any ofGPUs318 and executed without having to go through a just-in-time (JIT) compile stage. When a specific kernel is accessed bysoftware application310, the specific kernel may be retrieved from and/or stored in memory. Therefore, for future accesses of the same kernel, the kernel may be retrieved from memory instead of being retrieved frombinary306. In another embodiment, the kernel may be stored in memory withinGPUs318 so that the kernel can be quickly accessed the next time the kernel is executed.
The software development kit (SDK) library (.lib) file,SDK.lib312, may be utilized bysoftware application310 to provide access tobinary306 via dynamic-link library,SDK.dll308.SDK.dll308 may be utilized to access binary306 fromsoftware application310 at runtime, andSDK.dll308 may be distributed to end-user computing systems along withLLVM IR302.Software application310 may utilizeSDK.lib312 to access binary306 viaSDK.dll308 by making the appropriate API calls.
SDK.lib312 may include a plurality of functions for accessing the kernels inbinary306. These functions may include an open function, get program function, and a close function. The open function may open binary306 and load a master index table from binary306 into memory withinCPU316. The get program function may select a single kernel from the master index table and copy the kernel from binary306 intoCPU316 memory. The close function may release resources used by the open function.
In some embodiments, when the open function is called,software application310 may determine ifbinary306 has been compiled with the latest driver. If a new driver has been installed byCPU316 and ifbinary306 was compiled by a compiler from a previous driver, then theoriginal LLVM IR302 may be recompiled with the new compiler to create anew binary306. In one embodiment, only the individual kernel that has been invoked may be recompiled. In another embodiment, the entire library of kernels may be recompiled. In a further embodiment, the recompilation may not occur at runtime. Instead, an installer may recognize all of the binaries stored inCPU316, and when a new driver is installed, the installer may recompileLLVM IR302 and any other LLVM IRs in the background whenCPU316 is not busy.
In one embodiment,CPU316 may operate an OpenCL runtime environment.Software application310 may include an OpenCL application-programming interface (API) for accessing the OpenCL runtime environment. In other embodiments,CPU316 may operate other types of runtime environments. For example, in another embodiment, a DirectCompute runtime environment may be utilized.
Turning now toFIG. 4, a block diagram of one embodiment of an encrypted library is shown.Source code402 may be compiled to generateLLVM IR404.LLVM IR404 may be used to generateencrypted LLVM IR406, which may be conveyed toCPU416. Distributingencrypted LLVM IR406 to end-users may provide extra protection ofsource code402 and may prevent an unauthorized user from reverse-engineering LLVM IR404 to generate an approximation ofsource code402. Creating and distributingencrypted LLVM IR406 may be an option that is available for certain libraries and certain installation packages. For example, the software developer ofsource code402 may decide to use encryption to provide extra protection for their source code. In other embodiments, an IL version ofsource code402 may be provided to end-users and in these embodiments, the IL file may be encrypted prior to being delivered to target computing systems.
When encryption is utilized,compiler408 may include an embeddeddecrypter410, which is configured to decrypt encrypted LLVM IR files.Compiler408 may decryptencrypted LLVM IR406 and then perform the compilation to createunencrypted binary414, which may be stored inmemory412. In another embodiment,unencrypted binary414 may be stored in another memory (not shown) external toCPU416. In some embodiments,compiler408 may generate an IL representation (not shown) fromLLVM IR406 and then may generate unencrypted binary414 from the IL. In various embodiments, a flag may be set inencrypted LLVM IR406 to indicate that it is encrypted.
Referring now toFIG. 5, a block diagram of one embodiment of a portion of another computing system is shown.Source code502 may represent any number of libraries and kernels which may be utilized bysystem500. In one embodiment,source code502 may be compiled intoLLVM IR504.LLVM IR504 may be the same forGPUs510A-N. In one embodiment,LLVM IR504 may be compiled by separate compilers into intermediate language (IL)representations506A-N. A first compiler (not shown) executing onCPU512 may generateIL506A and thenIL506A may be compiled intobinary508A.Binary508A may be targeted toGPU510A, which may have a first type of micro-architecture. Similarly, a second compiler (not shown) executing onCPU512 may generateIL506N and thenIL506N may be compiled intobinary508N.Binary508N may be targeted toGPU510N, which may have a second type of micro-architecture different than the first type of micro-architecture ofGPU510A.
Binaries508A-N are representative of any number of binaries that may be generated andGPUs510A-N are representative of any number of GPUs that may be included in thecomputing system500.Binaries508A-N may also include any number of kernels, and different kernels fromsource code502 may be included within different binaries. For example,source code502 may include a plurality of kernels. A first kernel may be intended for execution onGPU510A, and so the first kernel may be compiled intobinary508A which targetsGPU510A. A second kernel fromsource code502 may be intended for execution onGPU510N, and so the second kernel may be compiled intobinary508N which targetsGPU510N. This process may be repeated such that any number of kernels may be included withinbinary508A and any number of kernels may be included withinbinary508N. Some kernels fromsource code502 may be compiled and included into both binaries, some kernels may be compiled into only binary508A, other kernels may be compiled into only binary508N, and other kernels may not be included into either binary508A or binary508N. This process may be repeated for any number of binaries, and each binary may contain a subset or the entirety of kernels originating fromsource code502. In other embodiments, other types of devices (e.g., FPGAs, ASICs) may be utilized withincomputing system500 and may be targeted by one or more ofbinaries508A-N.
Turning now toFIG. 6, one embodiment of a method for providing a library within an OpenCL environment is shown. For purposes of discussion, the steps in this embodiment are shown in sequential order. It should be noted that in various embodiments of the method described below, one or more of the elements described may be performed concurrently, in a different order than shown, or may be omitted entirely. Other additional elements may also be performed as desired.
Method600 may start inblock605, and then the source code of a library may be compiled into an intermediate representation (IR) (block610). In one embodiment, the source code may be written in OpenCL. In other embodiments, the source code may be written in other languages (e.g., C, C++, Fortran). In one embodiment, the IR may be a LLVM intermediate representation. In other embodiments, other IRs may be utilized. Next, the IR may be conveyed to a computing system (block620). The computing system may include a plurality of processors, including one or more CPUs and one or more GPUs. The computing system may download the IR, the IR may be part of an installation software package, or any of various other methods for conveying the IR to the computing system may be utilized.
Afterblock620, the IR may be received by a host processor of the computing system (block630). In one embodiment, the host processor may be a CPU. In other embodiments, the host processor may be a digital signal processor (DSP), system on chip (SoC), microprocessor, GPU, or the like. Then, the IR may be compiled into a binary by a compiler executing on the CPU (block640). The binary may be targeted to a specific target processor (e.g., GPU, FPGA) within the computing system. Alternatively, the binary may be targeted to a device or processor external to the computing system. The binary may include a plurality of kernels, wherein each of the kernels is directly executable on the specific target processor. In some embodiments, the kernels may be functions that take advantage of the parallel processing ability of a GPU or other device with a parallel architecture. The binary may be stored within CPU local memory, system memory, or in another storage location.
In one embodiment, the CPU may execute a software application (block650), and the software application may interact with an OpenCL runtime environment to schedule specific tasks to be performed by one or more target processors. To perform these tasks, the software application may invoke calls to one or more functions corresponding to kernels from the binary. When the function call executes, a request for the kernel may be generated by the application (conditional block660). Responsive to generating a request for a kernel, the application may invoke one or more API calls to retrieve the kernel from the binary (block670).
If a request for a kernel is not generated (conditional block660), then the software application may continue with its execution and may be ready to respond when a request to a kernel is generated. Then, after the kernel has been retrieved from the binary (block670), the kernel may be conveyed to the specific target processor (block680). The kernel may be conveyed to the specific target processor in a variety of manners, including as a string or in a buffer. Then, the kernel may be executed by the specific target processor (block690). Afterblock690, the software application may continue to be executed on the CPU until another request for a kernel is generated (conditional block660). Steps610-640 may be repeated a plurality of times for a plurality of libraries that are utilized by the computing system. It is noted that while kernels are commonly executed on highly parallelized processors such as GPUs, kernels may also be executed on CPUs or on a combination of GPUs, CPUs, and other devices in a distributed manner.
It is noted that the above-described embodiments may comprise software. In such an embodiment, program instructions and/or a database that represent the described methods and mechanisms may be stored on a non-transitory computer readable storage medium. The program instructions may include machine readable instructions for execution by a machine, a processor, and/or any general purpose computer for use with or by any non-volatile memory device. Suitable processors include, by way of example, both general and special purpose processors.
Generally speaking, a non-transitory computer readable storage medium may include any storage media accessible by a computer during use to provide instructions and/or data to the computer. For example, a non-transitory computer readable storage medium may include storage media such as magnetic or optical media, e.g., disk (fixed or removable), tape, CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, or Blu-Ray. Storage media may further include volatile or non-volatile memory media such as RAM (e.g., synchronous dynamic RAM (SDRAM), double data rate (DDR, DDR2, DDR3, etc.) SDRAM, low-power DDR (LPDDR2, etc.) SDRAM, Rambus DRAM (RDRAM), static RAM (SRAM)), ROM, non-volatile memory (e.g., Flash memory) accessible via a peripheral interface such as the USB interface, etc. Storage media may include micro-electro-mechanical systems (MEMS), as well as storage media accessible via a communication medium such as a network and/or a wireless link.
In other embodiments, the program instructions that represent the described methods and mechanisms may be a behavioral-level description or register-transfer level (RTL) description of hardware functionality in a hardware design language (HDL) such as Verilog or VHDL. The description may be read by a synthesis tool which may synthesize the description to produce a netlist comprising a list of gates from a synthesis library. The netlist comprises a set of gates which also represent the functionality of the hardware comprising the system. The netlist may then be placed and routed to produce a data set describing geometric shapes to be applied to masks. The masks may then be used in various semiconductor fabrication steps to produce a semiconductor circuit or circuits corresponding to the system. Alternatively, the database on the computer accessible storage medium may be the netlist (with or without the synthesis library) or the data set, as desired. While a computer accessible storage medium may carry a representation of a system, other embodiments may carry a representation of any portion of a system, as desired, including an IC, any set of programs (e.g., API, DLL, compiler) or portions of programs.
Types of hardware components, processors, or machines which may be used by or in conjunction with the present invention include ASICs, FPGAs, microprocessors, or any integrated circuit. Such processors may be manufactured by configuring a manufacturing process using the results of processed HDL instructions (such instructions capable of being stored on a computer readable medium). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the methods and mechanisms described herein.
Although the features and elements are described in the example embodiments in particular combinations, each feature or element can be used alone without the other features and elements of the example embodiments or in various combinations with or without other features and elements. It should also be emphasized that the above-described embodiments are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.