Offloading Design & Internals¶
Introduction¶
This document describes the Clang driver and code generation steps for creatingoffloading applications. Clang supports offloading to various architecturesusing programming models like CUDA, HIP, and OpenMP. The purpose of thisdocument is to illustrate the steps necessary to create an offloadingapplication using Clang.
OpenMP Offloading¶
Clang supports OpenMP target offloading to several different architectures suchas NVPTX, AMDGPU, X86_64, Arm, and PowerPC. Offloading code is generated byClang and then executed using thelibomptarget runtime and the associatedplugin for the target architecture, e.g.libomptarget.rtl.cuda. This sectiondescribes the steps necessary to create a functioning device image that can beloaded by the OpenMP runtime. More information on the OpenMP runtimes can befound at theOpenMP documentation page.
Offloading Overview¶
The goal of offloading compilation is to create an executable device image thatcan be run on the target device. OpenMP offloading creates executable images bycompiling the input file for both the host and the target device. The outputfrom the device phase then needs to be embedded into the host to create a fatobject. A special tool then needs to extract the device code from the fatobjects, run the device linking step, and embed the final image in a symbol thehost runtime library can use to register the library and access the symbols onthe device.
Compilation Process¶
The compiler performs the following high-level actions to generate OpenMPoffloading code:
Compile the input file for the host to produce a bitcode file. Lower
#pragmaomptargetdeclarations tooffloading entries and create metadata to indicate which entries are on the device.Compile the input file for the targetdevice usingtheoffloading entry metadata createdby the host.
Link the OpenMP device runtime library and run the backend to create a deviceobject file.
Run the backend on the host bitcode file and create afat object file using the device object file.
Pass the fat object file to thelinker wrapper tooland extract the device objects. Run the device linking action on the extractedobjects.
Wrap thedevice imagesandoffload entries in a symbol thatcan be accessed by the host.
Add thewrapped binary to the linker input andrun the host linking action. Link with
libomptargetto register andexecute the images.
Generating Offloading Entries¶
The first step in compilation is to generate offloading entries for the host.This information is used to identify function kernels or global values that willbe provided by the device. Blocks contained in a#pragmaomptarget orsymbols inside a#pragmaompdeclaretarget directive will have offloadingentries generated. The following table shows theoffload entry structure.
__tgt_offload_entry Structure¶ Type
Identifier
Description
void*
addr
Address of global symbol within device image (function or global)
char*
name
Name of the symbol
size_t
size
Size of the entry info (0 if it is a function)
int32_t
flags
Flags associated with the entry (seeTarget Region Entry Flags)
int32_t
reserved
Reserved, to be used by the runtime library.
The address of the global symbol will be set to the device pointer value by theruntime once the device image is loaded. The flags are set to indicate thehandling required for the offloading entry. If the offloading entry is an entryto a target region it can have one of the followingentry flags.
Target Region Entry Flags¶ Name
Value
Description
OMPTargetRegionEntryTargetRegion
0x00
Mark the entry as generic target region
OMPTargetRegionEntryCtor
0x02
Mark the entry as a global constructor
OMPTargetRegionEntryDtor
0x04
Mark the entry as a global destructor
If the offloading entry is a global variable, indicated by a non-zero size, itwill instead have one of the followingglobal flags.
Target Region Global¶ Name
Value
Description
OMPTargetGlobalVarEntryTo
0x00
Mark the entry as a ‘to’ attribute (w.r.t. the to clause)
OMPTargetGlobalVarEntryLink
0x01
Mark the entry as a ‘link’ attribute (w.r.t. the link clause)
The target offload entries are used by the runtime to access the device kernelsand globals that will be provided by the final device image. Each offloadingentry is set to use theomp_offloading_entries section. When the finalapplication is created the linker will provide the__start_omp_offloading_entries and__stop_omp_offloading_entries symbolswhich are used to create thefinal image.
This information is used by the device compilation stage to determine whichsymbols need to be exported from the device. We use theomp_offload.infometadata node to pass this information device compilation stage.
Accessing Entries on the Device¶
Accessing the entries in the device is done using the address field in theoffload entry. The runtime will setthe address to the pointer associated with the device image during runtimeinitialization. This is used to call the corresponding kernel function whenentering a#pragmaomptarget region. For variables, the runtime maintains atable mapping host pointers to device pointers. Global variables inside a#pragmaomptargetdeclare directive are first initialized to the host’saddress. Once the device address is initialized we insert it into the table tomap the host address to the device address.
Debugging Information¶
We generate structures to hold debugging information that is passed tolibomptarget. This allows the front-end to generate information the runtimelibrary uses for more informative error messages. This is done using thestandardidentifier structure used inlibomp andlibomptarget. This is used to pass information and sourcelocations to the runtime.
ident_t Structure¶ Type
Identifier
Description
int32_t
reserved
Reserved, to be used by the runtime library.
int32_t
flags
Flags used to indicate some features, mostly unused.
int32_t
reserved
Reserved, to be used by the runtime library.
int32_t
reserved
Reserved, to be used by the runtime library.
char*
psource
Program source information, stored as “;filename;function;line;column;;\0”
If debugging information is enabled, we will also create strings to indicate thenames and declarations of variables mapped in target regions. These have thesame format as the source location in theidentifier structure, but the function name is replaced with the variablename.
Offload Device Compilation¶
The input file is compiled for each active device toolchain. The devicecompilation stage is performed differently from the host stage. Namely, we donot generate any offloading entries. This is set by passing the-fopenmp-is-target-device flag to the front-end. We use the host bitcode todetermine which symbols to export from the device. The bitcode file is passed infrom the previous stage using the-fopenmp-host-ir-file-path flag.Compilation is otherwise performed as it would be for any other target triple.
When compiling for the OpenMP device, we set the visibility of all devicesymbols to beprotected by default. This improves performance and prevents aclass of errors where a symbol in the target device could preempt a hostlibrary.
The OpenMP runtime library is linked in during compilation to provide theimplementations for standard OpenMP functionality. For GPU targets this is doneby linking in a special bitcode library during compilation, (e.g.libomptarget-nvptx64-sm_70.bc) using the-mlink-builtin-bitcode flag.Other device libraries, such as CUDA’s libdevice, are also linked this way. Ifthe target is a standard architecture with an existinglibompimplementation, that will be linked instead. Finally, device tools are used tocreate a relocatable device object file that can be embedded in the host.
Creating Fat Objects¶
A fat binary is a binary file that contains information intended for anotherdevice. We create a fat object by embedding the output of the device compilationstage into the host as a named section. The output from the device compilationis passed to the host backend using the-fembed-offload-object flag. Thisembeds the device image into the.llvm.offloading section using a specialbinary format that behaves like a string map. This binary format is used tobundle metadata about the image so the linker can associate the proper devicelinking action with the image. Each device image will start with the magic bytes0x10FF10AD.
@llvm.embedded.object=privateconstant[1xi8]c"\00",section".llvm.offloading"
The device code will then be placed in the corresponding section one the backendis run on the host, creating a fat object. Using fat objects allows us to treatoffloading objects as standard host objects. The final object file shouldcontain the followingoffloading sections. Wewill use this information whenLinking Target Device Code.
Offloading Sections¶ Section
Description
omp_offloading_entries
Offloading entry information (see__tgt_offload_entry Structure)
.llvm.offloading
Embedded device object file for the target device and architecture
Linking Target Device Code¶
Objects containingOffloading Sections require special handling tocreate an executable device image. This is done using a Clang tool, seeClang Linker Wrapper for more information. This tool works as a wrapperover the host linking job. It scans the input object files for the offloadingsection.llvm.offloading. The device files stored in this section are thenextracted and passed to the appropriate linking job. The linked device image isthenwrapped to create the symbols used to loadthe device image and link it with the host.
The linker wrapper tool supports linking bitcode files through link timeoptimization (LTO). This is used whenever the object files embedded in the hostcontain LLVM bitcode. Bitcode will be embedded for architectures that do notsupport a relocatable object format, such as AMDGPU or SPIR-V, or if the userrequested it using the-foffload-lto flag.
Device Binary Wrapping¶
Various structures and functions are used to create the information necessary tooffload code on the device. We use thelinked device executable with the corresponding offloading entries to create the symbolsnecessary to load and execute the device image.
Structure Types¶
Several different structures are used to store offloading information. Thedevice image structure stores a singlelinked device image and its associated offloading entries. The offloadingentries are stored using the__start_omp_offloading_entries and__stop_omp_offloading_entries symbols generated by the linker using the__tgt_offload_entry Structure.
__tgt_device_image Structure¶ Type
Identifier
Description
void*
ImageStart
Pointer to the target code start
void*
ImageEnd
Pointer to the target code end
__tgt_offload_entry*
EntriesBegin
Begin of table with all target entries
__tgt_offload_entry*
EntriesEnd
End of table (non inclusive)
The targettarget binary descriptor isused to store all binary images and offloading entries in an array.
__tgt_bin_desc Structure¶ Type
Identifier
Description
int32_t
NumDeviceImages
Number of device types supported
__tgt_device_image*
DeviceImages
Array of device images (1 per dev. type)
__tgt_offload_entry*
HostEntriesBegin
Begin of table with all host entries
__tgt_offload_entry*
HostEntriesEnd
End of table (non inclusive)
Global Variables¶
Global Variables lists various global variables, along with theirtype and their explicit ELF sections, which are used to store device images andrelated symbols.
Global Variables¶ Variable
Type
ELF Section
Description
__start_omp_offloading_entries
__tgt_offload_entry
.omp_offloading_entries
Begin symbol for the offload entries table.
__stop_omp_offloading_entries
__tgt_offload_entry
.omp_offloading_entries
End symbol for the offload entries table.
__dummy.omp_offloading.entry
__tgt_offload_entry
.omp_offloading_entries
Dummy zero-sized object in the offload entriessection to force linker to define begin/endsymbols defined above.
.omp_offloading.device_image
__tgt_device_image
.omp_offloading_entries
ELF device code object of the first image.
.omp_offloading.device_image.N
__tgt_device_image
.omp_offloading_entries
ELF device code object of the (N+1)th image.
.omp_offloading.device_images
__tgt_device_image
.omp_offloading_entries
Array of images.
.omp_offloading.descriptor
__tgt_bin_desc
.omp_offloading_entries
Binary descriptor object (seeBinary Descriptor for Device Images)
Binary Descriptor for Device Images¶
This object is passed to the offloading runtime at program startup and itdescribes all device images available in the executable or shared library. Itis defined as follows:
__attribute__((visibility("hidden")))extern__tgt_offload_entry*__start_omp_offloading_entries;__attribute__((visibility("hidden")))extern__tgt_offload_entry*__stop_omp_offloading_entries;staticconstcharImage0[]={<Bufs.front()contents>};...staticconstcharImageN[]={<Bufs.back()contents>};staticconst__tgt_device_imageImages[]={{Image0,/*ImageStart*/Image0+sizeof(Image0),/*ImageEnd*/__start_omp_offloading_entries,/*EntriesBegin*/__stop_omp_offloading_entries/*EntriesEnd*/},...{ImageN,/*ImageStart*/ImageN+sizeof(ImageN),/*ImageEnd*/__start_omp_offloading_entries,/*EntriesBegin*/__stop_omp_offloading_entries/*EntriesEnd*/}};staticconst__tgt_bin_descBinDesc={sizeof(Images)/sizeof(Images[0]),/*NumDeviceImages*/Images,/*DeviceImages*/__start_omp_offloading_entries,/*HostEntriesBegin*/__stop_omp_offloading_entries/*HostEntriesEnd*/};
Global Constructor and Destructor¶
The global constructor (.omp_offloading.descriptor_reg()) registers thedevice images with the runtime by calling the__tgt_register_lib() runtimefunction. The constructor is explicitly defined in.text.startup section andis run once when the program starts. Similarly, the global destructor(.omp_offloading.descriptor_unreg()) calls__tgt_unregister_lib() forthe destructor and is also defined in.text.startup section and run when theprogram exits.
Offloading Example¶
This section contains a simple example of generating offloading code usingOpenMP offloading. We will use a simpleZAXPY BLAS routine.
#include<complex>usingcomplex=std::complex<double>;voidzaxpy(complex*X,complex*Y,complexD,std::size_tN){#pragma omp target teams distribute parallel forfor(std::size_ti=0;i<N;++i)Y[i]=D*X[i]+Y[i];}intmain(){conststd::size_tN=1024;complexX[N],Y[N],D;#pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N])zaxpy(X,Y,D,N);}
This code is compiled using the following Clang flags.
$clang++-fopenmp-fopenmp-targets=nvptx64-O3zaxpy.cpp-c
The output section in the object file can be seen using thereadelf utility.The.llvm.offloading section has theSHF_EXCLUDE flag so it will beremoved from the final executable or shared library by the linker.
$ llvm-readelf -WS zaxpy.oSection Headers:[Nr] Name Type Address Off Size ES Flg Lk Inf Al[11] omp_offloading_entries PROGBITS 0000000000000000 0001f0 000040 00 A 0 0 1[12] .llvm.offloading PROGBITS 0000000000000000 000260 030950 00 E 0 0 8
Compiling this file again will invoke theclang-linker-wrapper utility toextract and link the device code stored at the section named.llvm.offloading and then use entries stored inthe section namedomp_offloading_entries to create the symbols necessary forlibomptarget to register the device image and call the entry function.
$clang++-fopenmp-fopenmp-targets=nvptx64zaxpy.o-ozaxpy$./zaxpy
We can see the steps created by clang to generate the offloading code using the-ccc-print-phases option in Clang. This matches the description inOffloading Overview.
$clang++-fopenmp-fopenmp-targets=nvptx64-ccc-print-phaseszaxpy.cpp#"x86_64-unknown-linux-gnu"-"clang",inputs:["zaxpy.cpp"],output:"/tmp/zaxpy-host.bc"#"nvptx64-nvidia-cuda"-"clang",inputs:["zaxpy.cpp","/tmp/zaxpy-e6a41b.bc"],output:"/tmp/zaxpy-07f434.s"#"nvptx64-nvidia-cuda"-"NVPTX::Assembler",inputs:["/tmp/zaxpy-07f434.s"],output:"/tmp/zaxpy-0af7b7.o"#"x86_64-unknown-linux-gnu"-"clang",inputs:["/tmp/zaxpy-e6a41b.bc","/tmp/zaxpy-0af7b7.o"],output:"/tmp/zaxpy-416cad.o"#"x86_64-unknown-linux-gnu"-"Offload::Linker",inputs:["/tmp/zaxpy-416cad.o"],output:"a.out"
Relocatable Linking¶
The offloading compilation pipeline normally will defer the final device linkingand runtime registration until theclang-linker-wrapper is run to create theexecutable. This is the standard behaviour when compiling for OpenMP offloadingor CUDA and HIP in-fgpu-rdc mode. However, there are some cases where theuser may wish to perform this device handling prematurely. This is described inthelinker wrapper documentation.
Effectively, this allows the user to handle offloading specific linking ahead oftime when shipping objects or static libraries. This can be thought of asperforming a standard-fno-gpu-rdc compilation on a subset of object files.This can be useful to reduce link time, prevent users from interacting with thelibrary’s device code, or for shipping libraries to incompatible compilers.
Normally, if a relocatable link is done usingclang-r it will simply mergethe.llvm.offloading sections which will then be linked later when theexecutable is created. However, if the-r flag is used with the offloadingtoolchain, it will perform the device linking and registration phases and thenmerge the registration code into the final relocatable object file.
The following example shows how using the relocatable link with the offloadingpipeline can create a static library with offloading code that can beredistributed without requiring any additional handling.
$clang++-fopenmp-fopenmp-targets=nvptx64foo.cpp-c$clang++-lomptarget.devicertl--offload-link-rfoo.o-omerged.o$llvm-arrcslibfoo.amerged.o#g++app.cpp-L.-lfoo