Movatterモバイル変換


[0]ホーム

URL:


CN120122957A - Code compilation method, electronic device and storage medium - Google Patents

Code compilation method, electronic device and storage medium
Download PDF

Info

Publication number
CN120122957A
CN120122957ACN202510534557.1ACN202510534557ACN120122957ACN 120122957 ACN120122957 ACN 120122957ACN 202510534557 ACN202510534557 ACN 202510534557ACN 120122957 ACN120122957 ACN 120122957A
Authority
CN
China
Prior art keywords
thread local
local storage
variable
target thread
storage variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202510534557.1A
Other languages
Chinese (zh)
Other versions
CN120122957B (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltdfiledCriticalShanghai Bi Ren Technology Co ltd
Priority to CN202510534557.1ApriorityCriticalpatent/CN120122957B/en
Publication of CN120122957ApublicationCriticalpatent/CN120122957A/en
Application grantedgrantedCritical
Publication of CN120122957BpublicationCriticalpatent/CN120122957B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Embodiments of the present disclosure provide a code compiling method, an electronic device, and a storage medium. The code compiling method is used for compiling codes comprising at least one thread local storage variable, and comprises the steps of responding to the existence of the target thread local storage variable, distributing the at least one target thread local storage variable to a thread local memory based on the attribute of the at least one target thread local storage variable to obtain a thread local memory address corresponding to each target thread local storage variable, wherein the target thread local storage variable is the thread local storage variable used by a first function, updating an intermediate representation corresponding to the codes based on the thread local memory address corresponding to each target thread local storage variable, and generating machine codes corresponding to the codes based on the updated intermediate representation. The code compiling method provides a thread local storage model support under a parallel computing architecture, and solves the problem that the traditional thread local storage implementation cannot be directly applied to a GPU architecture.

Description

Code compiling method, electronic device and storage medium
Technical Field
Embodiments of the present disclosure relate to a code compiling method, an electronic device, and a storage medium.
Background
The underlying virtual machine (Low-Level Virtual Machine, LLVM) is an open-source compiler framework designed as a set of modular, reusable compiler and tool chain technologies. LLVM supports a thread local storage (Thread Local Storage, TLS) model, which can define thread local storage variables so that each thread has independent copies of variables, thereby avoiding inter-thread data race problems.
Disclosure of Invention
The embodiment of the disclosure provides a code compiling method, wherein the code comprises at least one thread local storage variable, the code compiling method comprises the steps of responding to existence of the target thread local storage variable, distributing the at least one target thread local storage variable to a thread local memory based on the attribute of the at least one target thread local storage variable to obtain a thread local memory address corresponding to each target thread local storage variable, wherein the target thread local storage variable is the thread local storage variable used by a first function in the at least one thread local storage variable, updating an intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local storage variable, and generating a machine code corresponding to the code based on the updated intermediate representation.
The code compiling method provided by at least one embodiment of the present disclosure further includes inserting an instruction for initializing a target thread local storage variable used by each first function into an intermediate representation corresponding to the code.
The code compiling method provided by at least one embodiment of the present disclosure further includes determining, for each target thread local storage variable, a first function using the target thread local storage variable to construct a variable-function mapping relationship.
In a code compiling method provided in at least one embodiment of the present disclosure, the first function directly uses the target thread local storage variable, or uses the target thread local storage variable by calling a second function.
In the code compiling method provided in at least one embodiment of the present disclosure, the first function is a function called by a host side and executed by a device side, and the second function is a function called by the device side and executed by the device side.
In the code compiling method provided by at least one embodiment of the present disclosure, the allocating the at least one target thread local storage variable to the thread local memory based on the attribute of the at least one target thread local storage variable includes determining a variable allocation order based on the attribute of the at least one target thread local storage variable, and sequentially allocating the at least one target thread local storage variable to the thread local memory according to the variable allocation order.
In the code compiling method provided in at least one embodiment of the present disclosure, the attribute of the target thread local storage variable includes at least one of a first function number corresponding to the target thread local storage variable, an alignment parameter of the target thread local storage variable, a size of the target thread local storage variable, and a defined order of the target thread local storage variable.
In the code compiling method provided in at least one embodiment of the present disclosure, the determining a variable allocation order based on the attribute of the at least one target thread local storage variable includes arranging the at least one target thread local storage variable in descending order according to a first function number corresponding to the at least one target thread local storage variable, arranging the at least one target thread local storage variable in descending order according to an alignment parameter of the at least one target thread local storage variable, arranging the at least one target thread local storage variable in ascending order according to a size of the at least one target thread local storage variable, or arranging the at least one target thread local storage variable in a definition order of the at least one target thread local storage variable.
In the code compiling method provided by at least one embodiment of the present disclosure, determining a variable allocation order based on an attribute of the at least one target thread local storage variable includes arranging the at least one target thread local storage variable in descending order according to a first function number corresponding to the at least one target thread local storage variable, arranging the plurality of first thread local storage variables in descending order according to an alignment parameter of the target thread local storage variable in response to the existence of a plurality of first thread local storage variables with the same corresponding first function number in the at least one target thread local storage variable, arranging the plurality of second thread local storage variables in ascending order according to a size of the target thread local storage variable in response to the existence of a plurality of second thread local storage variables with the same size, and arranging the plurality of defined local storage variables of the target thread local storage variables in ascending order.
In the code compiling method provided in at least one embodiment of the present disclosure, the allocating the at least one target thread local storage variable to the thread local memory based on the attribute of the at least one target thread local storage variable further includes, for a target thread local storage variable corresponding to a plurality of first functions, taking a maximum value of allocated offsets corresponding to the plurality of first functions as an allocation starting point of the target thread local storage variable in the thread local memory.
In the code compiling method provided in at least one embodiment of the present disclosure, the updating the intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local memory variable includes replacing, for each target thread local memory variable, an instruction for obtaining the address of the target thread local memory variable with the thread local memory address corresponding to the target thread local memory variable, where the intermediate representation corresponding to the code includes the instruction for obtaining the address of the target thread local memory variable.
The code compiling method provided by at least one embodiment of the present disclosure further includes obtaining a record of a thread local memory size required by all target thread local storage variables used by each first function in the code, so as to pre-allocate a fixed thread local memory space for each first function according to the record.
In the code compiling method provided in at least one embodiment of the present disclosure, a life cycle of a pre-allocated thread local memory space is consistent with an execution cycle of a corresponding first function.
At least one embodiment of the disclosure provides a code compiling device, wherein the code comprises at least one thread local storage variable, the code compiling device comprises an allocation module and a generation module, the allocation module is configured to allocate the at least one target thread local storage variable to a thread local memory based on the attribute of the at least one target thread local storage variable in response to the existence of the target thread local storage variable, the thread local memory address corresponding to each target thread local storage variable is obtained, the target thread local storage variable is a thread local storage variable used by a first function in the at least one thread local storage variable, the update module is configured to update an intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local storage variable, and the generation module is configured to generate a machine code corresponding to the code based on the updated intermediate representation.
At least one embodiment of the present disclosure provides an electronic device comprising at least one processor, at least one memory including one or more computer program modules, wherein the one or more computer program modules are stored in the at least one memory and configured to be executed by the at least one processor, the one or more computer program modules configured to implement the code compilation method provided by at least one embodiment of the present disclosure.
At least one embodiment of the present disclosure provides a non-transitory computer-readable storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by at least one processor, perform the code compilation method provided by at least one embodiment of the present disclosure.
The code compiling method, the code compiling device, the electronic equipment and the non-transient computer readable storage medium provided by at least one embodiment of the present disclosure provide support for a thread local storage model under a parallel computing architecture, and solve the problem that the traditional thread local storage implementation cannot be directly applied to a GPU architecture. The developer can directly program by using the high-level grammar, does not need to manually manage thread data, has simple and easy operation of a programming model, and greatly improves the development efficiency. And the corresponding thread local memory address can be directly fixed during compiling through compiling optimization, the thread local memory address corresponding to each thread local memory variable does not need to be loaded from a register, the thread local memory address does not need to be dynamically acquired during running, and better performance is realized in the aspect of acquiring the thread local memory address.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 is a flow chart of a method of compiling code according to at least one embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of compiling code according to at least one embodiment of the present disclosure;
FIG. 3 is a flow chart of a method of compiling code according to at least one embodiment of the present disclosure;
FIG. 4A is an exemplary diagram of a code compilation method provided by at least one embodiment of the present disclosure;
FIG. 4B is an exemplary diagram of a code compilation method provided by at least one embodiment of the present disclosure;
FIG. 5 is a flow chart of a method of compiling code according to at least one embodiment of the present disclosure;
FIG. 6 is a schematic block diagram of a code compiling apparatus according to at least one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of an electronic device provided in accordance with at least one embodiment of the present disclosure;
FIG. 8 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure, an
Fig. 9 is a schematic block diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
A flowchart is used in this disclosure to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
The present disclosure is illustrated by the following several specific examples. Detailed descriptions of known functions and known parts (elements) may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure. When any part (element) of an embodiment of the present disclosure appears in more than one drawing, the part (element) is denoted by the same or similar reference numeral in each drawing.
LLVM is an open-source compiler framework designed as a collection of modular, reusable compiler and tool chain technologies. LLVM includes a front end, a middle end, and a back end.
The front end of LLVM is responsible for converting the source code of different programming languages or programming models into a generic intermediate representation (INTERMEDIATE REPRESENTATION, IR). For example, clang is the compiler front end of LLVM native support C, C ++ and Objective-C. When a user declares a variable using thread_local (C++ 11) or __ thread (GCC extension), clang may mark the variable as a thread-local storage variable (TLS variable), e.g., may be marked in the generated intermediate representation by a thread_local attribute and a particular link type (e.g., dso _local).
The middle end of the LLVM is responsible for optimizing the intermediate representation generated by the front end to improve the performance of the final generated code. For example, the thread local storage access pattern may be simplified, combining multiple accesses into one address calculation. The thread-local storage model may also be implicitly selected, for example, based on the scope of the variable, or explicitly specified in the code.
The back end of the LLVM is responsible for converting the optimized intermediate representation into machine code or assembly code according to the characteristics of the target hardware platform. Specific tasks of the backend may include aspects of instruction selection, register allocation, instruction scheduling, code generation, and the like. For example, instruction selection refers to mapping an intermediate representation to an instruction set of a target hardware platform, register allocation refers to mapping virtual registers in the intermediate representation to actual registers of the target hardware, instruction scheduling refers to rearranging instruction order to maximize utilization of hardware resources, and code generation refers to generating machine code or assembly code corresponding to the target hardware platform.
The LLVM supports a thread local storage model, and thread local storage variables can be defined through thread_local keywords, so that each thread has independent variable copies, and the problem of data competition among threads is avoided. That is, thread local storage variables are not shared among threads, meaning that even if multiple threads access the same variable name, they actually operate on separate memory regions.
However, the inventors of the present disclosure have noted that the implementation of thread local storage depends on the support of the target platform. Currently in conventional central processing unit (Central Processing Unit, CPU) architectures (e.g., x86 and ARM), operating system and application binary interfaces (Application Binary Interface, ABI) provide support for thread-local storage, for example, by supporting thread-local storage through specific segments in executable and linkable format (Executable and Linkable Format, ELF) files (e.g., a. Tdata segment for storing uninitialized thread-local storage variables and a. Tbss segment for storing initialized thread-local storage variables). However, no similar mechanism is available in graphics processors (Graphics Processing Unit, GPUs), general-purpose graphics processors (General-Purpose Graphics Processing Unit, GPGPUs), or other parallel computing architecture-related backend (e.g., CUDA and OpenCL) to support thread-local storage implementations, which presents challenges for implementing high-performance computing and machine learning, etc. application scenarios requiring massively parallel processing.
Parallel thread models of GPUs (e.g., CUDA and OpenCL) typically require each thread to independently access its private data, but traditional thread local storage implementations (e.g., __ thread keys of CPU) cannot be mapped directly onto GPU architecture, which is difficult to meet the requirements of creating thread private data and supporting related data management and optimization. Moreover, the lack of hardware-level thread local storage support by GPUs results in developers having to manually manage thread local data (e.g., compute offsets through thread indexes), increasing development complexity and being prone to error. In existing GPU programming practices, developers need to explicitly pass thread indexes (e.g., threadidx. X in CUDA) and manually allocate storage of global memory emulation thread private data, which results in code redundancy and performance penalty. In addition, the existing LLVM compilers have insufficient support for thread local storage of the GPU and cannot automatically generate efficient memory allocation and access codes.
The inventors of the present disclosure also note that the thread local storage model supported by the current mainstream CPU architecture has performance limitations, such as instruction and register overhead when the static model accesses the thread local storage variable, and runtime parsing overhead when the dynamic model accesses the thread local storage variable, as follows:
(1) Static model
In a CPU architecture, the static model may access thread local storage variables by using segment registers (e.g., FS segment registers of Linux, GS segment registers of Windows) plus an offset. One example may be referred to as assembly code:
movq %fs:0, %rax ; //LINE 1
addq $TLS_var@tpoff, %rax ; //LINE 2
code LINE 1 indicates loading the TLS base address from the address pointed to by the FS segment register to the RAX register, and code LINE 2 indicates adding the offset calculated at compile time (@ tpoff) to the base address, the actual memory address of the tls_var variable of the current thread being stored in register RAX. At the time of static linking, the final offset (e.g., 0x 10) of the tls_var variable is determined by the linker and @ tpoff is replaced with the actual value of the offset. According to the method, movq is needed to be executed to load the base address every time the thread local storage variable is accessed, and the RAX register is needed to be occupied to temporarily store the base address, so that the register pressure is increased, and therefore, larger instruction cost and register cost exist.
(2) Dynamic model
In the CPU architecture, the dynamic model obtains the address of the thread local storage variable by calling __ tls_get_addr () at Runtime (run time) or dynamically calculating offset, which may cause function call overhead or memory access overhead, and has low performance.
The embodiment of the disclosure provides a code compiling method for compiling a code comprising at least one thread local storage variable, which comprises the steps of responding to existence of the target thread local storage variable, distributing the at least one target thread local storage variable to a thread local memory based on the attribute of the at least one target thread local storage variable to obtain a thread local memory address corresponding to each target thread local storage variable, wherein the target thread local storage variable is a thread local storage variable used by a first function in the at least one thread local storage variable, updating an intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local storage variable, and generating a machine code corresponding to the code based on the updated intermediate representation.
The code compiling method provided by at least one embodiment of the present disclosure may be performed by a compiler. For example, the compiler may be located at the host side, and the code compiling method may be executed by the host side.
In the code compiling method provided by at least one embodiment of the present disclosure, support of a thread local storage model under a parallel computing architecture is provided, so that a problem that a traditional thread local storage implementation cannot be directly applied to a GPU architecture is solved. Through the code compiling method, a developer can directly program by using a high-level grammar (such as __ thread), thread data does not need to be manually managed, a programming model is simple and easy to operate, and development efficiency is greatly improved. In addition, the code compiling method directly fixes the corresponding thread local memory address during compiling through compiling optimization, does not need to load the thread local memory address corresponding to each thread local memory variable from a register, does not need to dynamically acquire the thread local memory address during running, and achieves better performance in the aspect of acquiring the thread local memory address.
FIG. 1 is a flow chart of a method of compiling code for compiling code that includes at least one thread local storage variable according to at least one embodiment of the present disclosure. For example, as shown in fig. 1, the code compiling method provided in at least one embodiment of the present disclosure includes the following steps S101 to S103.
Step S101, responding to the existence of the target thread local storage variable, and distributing the at least one target thread local storage variable to the thread local memory based on the attribute of the at least one target thread local storage variable to obtain the thread local memory address corresponding to each target thread local storage variable.
For example, in step S101, the target thread local storage variable refers to a thread local storage variable used by the first function among the at least one thread local storage variable. That is, the "at least one thread-local storage variable" described above may include thread-local storage variables used by the first function, or may include thread-local storage variables not used by any of the first functions. Step S101 allocates a space of the thread local memory for the thread local memory variable that is used, and the thread local memory variable that is not used is not allocated into the thread local memory.
In the code compiling method provided in at least one embodiment of the present disclosure, the code may include a plurality of first functions and a plurality of second functions. The first function is a function called by the host side and executed by the device side, and the second function is a function called by the device side and executed by the device side.
For example, in at least one embodiment of the present disclosure, the host side may include a Central Processing Unit (CPU), the device side may include a Graphics Processor (GPU), a General Purpose Graphics Processor (GPGPU), and the like.
For example, in at least one embodiment of the present disclosure, the first function may be a Kernel function (Kernel function) and the second function may be a Device function (Device function). A kernel function refers to a function called by a host side and executed by a device side, and a device function refers to a function called by a device side and executed by a device side. For example, a kernel function may be called by a host function (e.g., host function) running in the CPU, while a device function may be called by a kernel function or other device function executing at the GPU. Code typically includes a plurality of kernel functions and device functions, each of which may call a plurality of device functions, and each of which may also be called by a plurality of kernel functions.
"Thread-local storage variables used by the first function" refers to thread-local storage variables directly used by the first function, and thread-local storage variables used by the first function by calling the second function, which are all target thread-local storage variables.
The thread local memory (Thread Local Memory, TLM) refers to a memory area private to each thread, and is mainly used for storing data private to the thread, and other threads cannot directly access. For example, the thread local memory may be a thread local memory in the GPU, and by saving the target thread local memory variable on the thread local memory, the burden on registers in the GPU may be reduced. For example, the thread local memory may also be thread local memory in other parallel computing architectures, which is not limited by the embodiments of the present disclosure.
It should be noted that the thread local memory and the thread local storage described above are two different concepts, and the thread local memory refers to a memory area, and the thread local storage refers to a variable storage manner.
For example, in step S101, the target thread local storage variable may be allocated to the thread local memory based on the attribute of the target thread local storage variable, so that the thread local memory address corresponding to each target thread local storage variable may be obtained. The attributes of the target thread local storage variables may include, for example, variable type, variable size, alignment requirements, etc., which are not limiting embodiments of the present disclosure.
Step S102, updating the intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local memory variable.
For example, the Clang front end of the LLVM may automatically generate the intermediate representation of the code correspondence. In LLVM, TLSSupported member variables of the TargetInfo class can be used to control whether the thread local storage function is enabled. Setting the TLSSupported member variable to true enables thread local storage function support at the LLVM front end.
For example, assume that the user has written the following code:
__device__ __thread int a=1; //LINE1
__global__ void kernel() {
c[1]=2; //LINE3
}
the code LINE1 described above is a declarative definition of the thread local storage variable a, and in some examples, the intermediate automatically generated by the Clang front end of LLVM from the code LINE1 is represented as follows:
@a = thread_local addrspace(1) externally_initialized global i32 1, align 4
In the above intermediate representation, @ a represents a variable a is defined as a global (global) variable, thread_local represents a variable a is a thread local storage variable, addrspace (1) specifies that the address space of the variable is global (global), extrapoly_ initialized represents that initialization of the variable is done externally (e.g., at run-time, not during compilation), global i32 1 represents defining a global variable, type is a 32-bit integer, initial value is 1, align 4 represents that the alignment of the variables is4 byte alignment.
The code LINE3 implements the assignment operation of the thread local storage variable c [1], and in some examples, the intermediate representation automatically generated by the Clang front end of the LLVM according to the code LINE3 is as follows:
%1 = tail call ptr @llvm.threadlocal.address.p0(ptr addrspacecast (ptr addrspace(1) @c to ptr))
arrayidx1 = getelementptr inbounds [2 x float], ptr %1, i64 0, i64 1
store float 2.000000e+00, ptr %arrayidx1, align 4
In the above intermediate representation, the llvm.thread.address.p0 instruction is used to obtain the address of the thread local storage variable c,% 1 stores the address of the thread local storage variable c,% arrayidx stores the address of the thread local storage variable c [1], and the store (store) instruction stores floating point number 2.0 in the storage location pointed to by% arrayidx, i.e., the storage location corresponding to c [1 ].
In step S102, the intermediate representation automatically generated by the Clang front-end may be updated based on the results of the step S101 variable assignment to obtain an optimized intermediate representation.
In the code compiling method provided in at least one embodiment of the present disclosure, the intermediate representation corresponding to the code includes an instruction for obtaining an address of a target thread local storage variable, and one example of step S102 may be that, for each target thread local storage variable, the instruction for obtaining the address of the target thread local storage variable is replaced with a thread local memory address corresponding to the target thread local storage variable.
For example, the instruction for obtaining the address of the target thread local storage variable is, for example, llvm.threadlocal address.p0, traversing the calling function of each target thread local storage variable, and directly replacing all the llvm.threadlocal address.p0 instructions with the thread local memory address corresponding to the target thread local storage variable to obtain the optimized intermediate representation.
In some examples, the intermediate representation before optimization (i.e., the intermediate representation automatically generated by the Clang front-end) is as follows:
%0 = tail call ptr @llvm.threadlocal.address.p0(ptr addrspacecast (ptr addrspace(1) @th to ptr))
store i32 66, ptr %0, align 4
The corresponding optimized middle is expressed as follows:
store i32 66, ptr addrspacecast (ptr addrspace(4) inttoptr (i32 24 to ptr addrspace(4)) to ptr), align 4
In the optimized intermediate representation, the llvm.thread.address.p0 instruction is not needed any more, wherein 24 represents the thread local memory address of the target thread local memory variable th obtained by the allocation in step S101, addrspace (4) designates the address space of the variable as local (local), ADDRSPACECAST represents address space conversion, so as to generate a correct thread local memory read-write instruction, and meets the requirements of the target architecture.
Therefore, the optimized intermediate representation fixes the corresponding thread local memory address during compiling, the thread local memory address corresponding to each target thread local memory variable does not need to be loaded from a register, the thread local memory address does not need to be dynamically acquired during running, and better performance is realized in the aspect of acquiring the thread local memory address.
And step S103, generating a machine code corresponding to the code based on the updated intermediate representation.
For example, in step S103, the optimized intermediate representation obtained in step S102 may be converted into machine code suitable for the target hardware platform by steps such as instruction selection, register allocation, instruction scheduling, code generation, etc. as described above. The optimization of the intermediate representation in step S102 helps to reduce the effort of instruction selection, resulting in improved performance.
Specifically, in the CPU architecture, instruction selection for thread local store variable access instructions (e.g., ISD:: globalTLSAddress) is made at the target demotion (TargetLowering) stage. The target demotion stage is primarily responsible for converting the intermediate representation into a lower-level representation related to the target hardware platform. Specifically, in the process of converting the intermediate representation into the machine code, the access to the TLS variable is realized by using a mode of adding an offset to a segment register, and corresponding read-write operation is completed. In the code compiling method provided by at least one embodiment of the present disclosure, before the instruction is selected, the instruction for obtaining the address of the target thread local storage variable is replaced with the thread local memory address corresponding to the target thread local storage variable, so that the workload of instruction selection is effectively reduced.
The code compiling method provided in at least one embodiment of the present disclosure further includes the following step S104. Step S104 may be performed before step S101.
Step S104, for each target thread local storage variable, determining a first function using the target thread local storage variable to construct a variable-function mapping relation.
In step S104, the path of the variables through the functions may be tracked according to a call graph (CALL GRAPH, also called a call relationship graph) to determine which functions the target thread local storage variables are used by. The call graph is a directed graph, and can represent call relations among functions in a program. The compiler may generate the call graph by means of, for example, static analysis. For example, call relationships between functions may be parsed by a call graph Builder (CALL GRAPH Builder) integrated in the compiler, and call graphs generated. For example, the variable-function mapping relationship records which first functions each target thread locally stores variable used by, and the variable-function mapping relationship may be presented in the form of a map.
It should be noted that, since some thread-local storage variables in the code may be defined only and not used by any first function, not every thread-local storage variable can find the corresponding first function in the variable-function mapping relationship.
In step S104, "using the first function of the target thread local storage variable" includes directly or indirectly using the first function of the target thread local storage variable. For example, "directly use" refers to a first function directly using the target thread local storage variable, and "indirectly use" refers to a first function using the target thread local storage variable by calling a second function.
Correspondingly, one example of step S104 may be to determine, for each target thread local storage variable, a kernel function that directly or indirectly uses the target thread local storage variable to construct a variable-function mapping relationship.
Fig. 2 is a flowchart of a code compiling method according to at least one embodiment of the present disclosure.
As shown in fig. 2, in the code compiling method according to at least one embodiment of the present disclosure, an example of step S101 may include the following steps S201 to S202.
Step S201, determining a variable allocation sequence based on the attribute of the local storage variable of at least one target thread.
For example, in step S201, the plurality of target thread local storage variables may be ordered according to the attributes of the target thread local storage variables.
Step S202, sequentially distributing at least one target thread local storage variable to the thread local memory according to the variable distribution sequence.
For example, in step S202, according to the variable allocation order determined in step S201, a plurality of target thread local storage variables may be sequentially allocated to corresponding positions in the thread local memory.
In the code compiling method provided in at least one embodiment of the present disclosure, one example of step S101 may further include the following step S203.
Step S203, regarding the target thread local storage variable corresponding to the plurality of first functions, taking the maximum value of the allocated offsets corresponding to the plurality of first functions as an allocation starting point of the target thread local storage variable in the thread local memory.
For example, in step S203, the target thread local storage variable corresponding to the plurality of first functions means that the target thread local storage variable is directly or indirectly used by the plurality of first functions. In some examples, direct use may use the target thread local storage variable directly, e.g., for a kernel function, and indirect use may use the target thread local storage variable by calling a device function, e.g., for a kernel function.
For example, assume that the thread local storage variable b is used only by Kernel functions Kernel 1 and Kernel 2 (thread local storage variable b is the target thread local storage variable), and that the allocated offset corresponding to Kernel function Kernel 1 is 4 bytes and the allocated offset corresponding to Kernel function Kernel 2 is 0 bytes. It is known that the maximum value of the allocated offset is 4 bytes, and the maximum value is taken as the allocation starting point of the thread local storage variable b in the thread local memory, namely, the allocation starting address of the thread local storage variable b in the thread local memory is 4 instead of 0.
It should be noted that, in the code compiling method provided in at least one embodiment of the present disclosure, a separate section of thread local memory space is allocated for each kernel function, where "the allocation start point in the thread local memory" refers to an offset of a start address relative to the thread local memory space of each section. For example, kernel function Kernel 1 corresponds to a segment of thread local memory space T1, and Kernel function Kernel 2 corresponds to a segment of thread local memory space T2. To simplify the program logic, for each kernel function, the starting address of the corresponding thread local memory space is considered to be 0, that is, the starting addresses of T1 and T2 are considered to be 0, and then can be mapped to different physical addresses through hardware processing. Thus, in the example above, the storage locations of thread local storage variable b in both thread local memory spaces T1 and T2 are offset back by 4 bytes from start address 0.
Since it is possible that multiple kernel functions and multiple device functions use the same thread local storage variable, and the device function cannot determine which kernel functions will be called by itself, it is required that the addresses allocated to the same thread local storage variable in all kernel functions and device functions are identical, that is, whichever kernel function or device function can find the thread local storage variable in the thread local memory through the same address. The manner provided in step S203 may be used to achieve the above-required consistency of address allocation of the same thread local storage variable, i.e. the kernel function or the device function may find the thread local storage variable b in the thread local memory through the address 4.
Note that, for the target thread local storage variable used by only a single first function, it is not necessary to determine the allocation start point in the thread local memory according to step S203, and it is only necessary to allocate the target thread local storage variable sequentially.
In the code compiling method provided in at least one embodiment of the present disclosure, the attribute of the target thread local storage variable may include at least one of a first function number corresponding to the target thread local storage variable, an alignment parameter of the target thread local storage variable, a size of the target thread local storage variable, and a defined order of the target thread local storage variable.
For example, the number of first functions corresponding to the target thread local storage variable may be understood as the number of first functions using the target thread local storage variable. For example, assuming that n kernel functions directly or indirectly use the target thread local storage variable, the first function number corresponding to the target thread local storage variable is n.
For example, the pair Ji Canshu (alignment) of target thread local storage variables is a parameter for explicitly specifying the alignment of the variables in storage, typically in bytes, e.g., 4 bytes, 8 bytes, 16 bytes, etc. For example, when the alignment parameter of the target thread local storage variable is 4 bytes, it indicates that the target thread local storage variable needs to be aligned by 4 bytes when stored.
For example, the size (size) of the target thread local storage variable may be understood as the amount of storage space the target thread local storage variable needs to occupy, depending on the type of variable, typically in bytes. For example, the size of the target thread local storage variable of the int type is 4 bytes. It should be noted that the size of the local storage variable of the target thread of the same type may be different in different systems or compilers.
For example, the defined order of the target thread local storage variables refers to the order defined by each target thread local storage variable in the code, as determined by the user when writing the code.
For example, one or more attributes of the first function number corresponding to the target thread local storage variable, the alignment parameter of the target thread local storage variable, the size of the target thread local storage variable, and the defined order of the target thread local storage variable may be considered in determining the variable allocation order. For example, in some examples, the variable allocation order may be determined based solely on the first function number corresponding to the at least one target thread local storage variable, may be determined based solely on an alignment parameter of the at least one target thread local storage variable, may be determined based solely on a size of the at least one target thread local storage variable, or may be determined based solely on a defined order of the at least one target thread local storage variable. For example, in other examples, the variable allocation order may be determined based on a first function number corresponding to the at least one target thread local storage variable, an alignment parameter of the at least one target thread local storage variable, a size of the at least one target thread local storage variable, and a defined order of the at least one target thread local storage variable. For example, two or three of the above four attributes may be selected as the basis for determining the variable assignment order.
It should be noted that the above three attributes are only examples, and the variable allocation order may be determined according to actual needs and other attributes, which is not limited by the embodiments of the present disclosure.
In the code compiling method provided in at least one embodiment of the present disclosure, an example of the step S201 may include any one of the following steps S301 to S304. For example, if only one attribute is considered in determining the variable allocation order, a corresponding step may be selected from steps S301 to S304 according to the attribute to be executed.
Step S301, arranging at least one target thread local storage variable in descending order according to the first function quantity corresponding to the at least one target thread local storage variable.
Step S302, arranging at least one target thread local storage variable in descending order according to the alignment parameter of the at least one target thread local storage variable.
Step S303, arranging at least one target thread local storage variable according to the size ascending order of the at least one target thread local storage variable.
Step S304 is to arrange the at least one target thread local storage variable according to the defined sequence of the at least one target thread local storage variable.
For example, in step S301, the target thread local storage variables may be ordered from large to small according to the number of first functions corresponding to the target thread local storage variables. For example, assuming that there are three target thread local storage variables a, b, and c to be ordered, where the number of first functions corresponding to the target thread local storage variable a is 4, the number of first functions corresponding to the target thread local storage variable b is 5, and the number of first functions corresponding to the target thread local storage variable c is 2, the ordering result is b, a, and c. For example, in some examples, if there are multiple target threads that have the same number of corresponding first functions to store the variables locally, they may be further ordered in the defined order of the variables.
For example, in step S302, the target thread local storage variables may be ordered from large to small in terms of their alignment parameters. For example, assuming that there are three target thread local storage variables a, b, c to be ordered, where the alignment parameter of the target thread local storage variable a is 4 bytes, the alignment parameter of the target thread local storage variable b is 8 bytes, and the alignment parameter of the target thread local storage variable c is 16 bytes, the ordering result is c, b, a. For example, in some examples, if there are multiple target threads with the same alignment parameters that store the variables locally, they may be further ordered in the defined order of the variables.
For example, in step S303, the target thread local storage variables may be ordered from small to large in size. For example, assuming that there are three target thread local storage variables a, b, c to be ordered, where the size of the target thread local storage variable a is 8 bytes, the size of the target thread local storage variable b is 1 byte, and the size of the target thread local storage variable c is4 bytes, the ordering result is b, c, a. For example, in some examples, if there are multiple target threads of the same size that store variables locally, they may be further ordered in the defined order of the variables.
For example, in step S304, the target thread local storage variables may be ordered in the defined order of the target thread local storage variables. For example, assuming that there are three target threads to be ordered that store variables a, b, c locally and that define an order of a, b, c, the ordering result is a, b, c.
The sorting method is to make the allocated local memory space of the thread as small as possible, and improve the utilization rate of the local memory of the thread. The ordering may also be performed in other ways, as desired, and embodiments of the present disclosure are not limited in this regard.
Fig. 3 is a flowchart of a code compiling method according to at least one embodiment of the present disclosure.
As shown in fig. 3, in the code compiling method provided in at least one embodiment of the present disclosure, if all four of the above attributes are considered in determining the variable allocation order, one example of the above step S201 may include the following steps S401 to S404.
Step S401, arranging at least one target thread local storage variable in descending order according to the first function quantity corresponding to the at least one target thread local storage variable.
Step S402, in response to the fact that a plurality of first thread local storage variables with the same corresponding first function number exist in at least one target thread local storage variable, the plurality of first thread local storage variables are arranged in descending order according to the alignment parameters of the target thread local storage variables.
Step S403, in response to the fact that the plurality of second thread local storage variables with the same alignment parameters exist in the plurality of first thread local storage variables, the plurality of second thread local storage variables are arranged in ascending order according to the size of the target thread local storage variable.
Step S404, in response to the existence of a plurality of third thread local storage variables with the same size in the plurality of second thread local storage variables, arranging the plurality of third thread local storage variables according to the defined sequence of the target thread local storage variables.
For example, in step S401, the target thread local storage variables may be ordered from large to small according to the number of first functions corresponding to the target thread local storage variables.
For example, in step S402, if it is determined that there are a plurality of first thread local storage variables having the same number of first functions in the target thread local storage variables according to the sorting result of step S401, the plurality of first thread local storage variables are sorted from large to small according to the alignment parameter of the target thread local storage variables. It should be noted that, in step S401, only the variables whose order has not been determined (i.e., the first thread local storage variables) are processed in step S402, and the variables whose order has been determined are not considered.
For example, in step S403, if it is determined that a plurality of second thread local storage variables having the same alignment parameter exist among the plurality of first thread local storage variables according to the sorting result of step S402, the plurality of second thread local storage variables are sorted from small to large according to the size of the target thread local storage variable. It should be noted that, step S403 only processes the variables whose order has not been determined in step S402 (i.e., the second thread local storage variables), and the variables whose order has been determined are not considered.
For example, in step S404, if it is determined that a plurality of third thread local storage variables having the same size exist in the plurality of second thread local storage variables according to the sorting result in step S403, the plurality of third thread local storage variables are arranged in the defined order of the target thread local storage variables, thereby obtaining a final sorting result. Defining an order herein refers to defining an order of variables when a user writes code. It should be noted that, in step S403, only the variables whose order has not been determined (i.e., the third thread local storage variables) are processed in step S404, and the variables whose order has been determined are not considered.
It should be noted that the above steps S401 to S404 are just examples, and provide an attribute priority that is 1, the number of the first functions corresponding to the target thread local storage variable, 2, the alignment parameter of the target thread local storage variable, 3, the size of the target thread local storage variable, and 4, the definition sequence of the target thread local storage variable. The attribute priority is set to make the allocated thread local memory gap as small as possible, and improve the utilization rate of the thread local memory. According to actual needs, the sorting may be performed according to other attribute priorities, for example, sorting may be performed according to a size, sorting may be performed further according to an alignment parameter, and the embodiment of the disclosure is not limited thereto.
The following is a specific example of the above steps S401 to S404.
For example, assuming that there are six thread-local storage variables a, b, c, d, e, f to be ordered (all target thread-local storage variables), where the number of first functions corresponding to thread-local storage variable a is 4, the number of first functions corresponding to thread-local storage variable b is 5, the number of first functions corresponding to thread-local storage variable c, d, e, f is 2, and step S401 is performed, the ordering results are b, a, (c, d, e, f), where (c, d, e, f) means that the relative order between variables c, d, e, f has not been determined, and it can only be confirmed that these four variables are all located after variable a.
Since the first function number of variables c, d, e, f is the same (i.e., c, d, e, f is the first thread local storage variable), step S402 needs to be performed to determine the relative order between the four variables based on the alignment parameters. For example, further assume that the alignment parameters of the thread local storage variables c, d, e are 4 bytes, the alignment parameter of the thread local storage variable f is 8 bytes, and step S402 is performed, and the further ordering results are b, a, f, (c, d, e), where (c, d, e) means that the relative order between the variables c, d, e has not yet been determined, and it can only be confirmed that these three variables are all located after the variable f.
Since the alignment parameters of the variables c, d, e are the same (i.e., c, d, e are all second thread local storage variables), step S403 needs to be performed to determine the relative order between the three variables according to the size. For example, further assuming that the size of the thread local storage variable c is 2 bytes, and the sizes of the thread local storage variables d and e are 1 byte, performing step S403, the further ordering results are b, a, f, (d, e), and c, where (d, e) means that the relative order between the variables d and e has not yet been determined, and it can only be confirmed that these two variables are located after the variable f and before the variable c.
Since the variables d, e are the same size (i.e., d, e are both third thread local storage variables), step S404 needs to be performed to determine the relative order between the two variables according to the defined order. For example, further assume that the thread local storage variable d is defined first and the thread local storage variable e is defined later, and step S404 is performed, and the further sorting result is b, a, f, d, e, c, that is, the final sorting result.
Fig. 4A is an exemplary schematic diagram of a code compiling method according to at least one embodiment of the present disclosure.
For example, as shown in FIG. 4A, three kernel functions and six device functions are included in the code. The three Kernel functions are respectively represented by Kernel 1 to Kernel3, and the six device functions are respectively represented by Func1 to Func6. As shown in the right call diagram of fig. 4A, kernel function Kernel 1 calls device functions Func1 and Func2, kernel function Kernel 2 calls device functions Func3 and Func4, and Kernel3 calls device functions Func5 and Func6. The device function Func1 uses a thread local storage variable a, the device function Func2 uses a thread local storage variable b, the device function Func3 uses a thread local storage variable b, the device function Func4 uses a thread local storage variable c, the device function Func5 uses a thread local storage variable c, and the device function Func6 uses a thread local storage variable a. The alignment parameters of the thread local storage variables a, b and c are all 4 bytes, and the sizes are also all 4 bytes. The thread local storage variables a, b and c are all target thread local storage variables.
The steps S401 to S404 are executed to sort the thread local storage variables a, b, c. From the above, it can be known that the number of the first functions corresponding to the thread local storage variable a is 2 (Kernel functions Kernel 1 and Kernel 3), the number of the first functions corresponding to the thread local storage variable b is 2 (Kernel functions Kernel 1 and Kernel 2), and the number of the first functions corresponding to the thread local storage variable c is 2 (Kernel functions Kernel 2 and Kernel 3). Therefore, the first function number, the alignment parameters and the sizes of the thread local storage variables a, b and c are the same, and the final ordering result cannot be obtained according to the steps S401 to S403. Step S404 is executed to order the thread local storage variables a, b and c according to the defined order, and the final ordering result is a, b and c.
First, a line local storage variable a is allocated. The thread local storage variable a is used by Kernel functions Kernel 1 and Kernel 3, and the assigned offsets corresponding to Kernel 1 and Kernel 3 are all 0 bytes, i.e. the maximum value of the assigned offset is 0 bytes. Therefore, the allocation starting address of the thread local storage variable a in the thread local memory is 0, and 4 bytes of thread local memory space are allocated.
Next, the thread local storage variable b is allocated. The thread local storage variable b is used by Kernel functions Kernel 1 and Kernel 2, and the allocated offset corresponding to Kernel function Kernel 1 is 4 bytes, and the allocated offset corresponding to Kernel function Kernel 2 is 0 bytes, i.e. the maximum value of the allocated offset is 4 bytes. Therefore, the allocation starting address of the thread local storage variable b in the thread local memory is 4, and 4 bytes of thread local memory space are allocated.
Finally, the line local storage variable c is allocated. The thread local storage variable c is used by Kernel functions Kernel 2 and Kernel 3, and the allocated offset corresponding to Kernel 2 is 8 bytes, and the allocated offset corresponding to Kernel 3 is 4 bytes, i.e. the maximum value of the allocated offset is 8 bytes. Therefore, the allocation starting address of the thread local storage variable c in the thread local memory is 8, and 4 bytes of thread local memory space are allocated.
Fig. 4B is an exemplary schematic diagram of a code compiling method according to at least one embodiment of the present disclosure.
For example, as shown in FIG. 4B, two kernel functions and four device functions are included in the code. Two Kernel functions are respectively represented as Kernel 1 to Kernel 2, for example, and four device functions are respectively represented as Func1 to Func4, for example. As shown in the right call diagram of fig. 4B, kernel function Kernel 1 calls device functions Func1 and Func2, and Kernel function Kernel 2 calls device functions Func3 and Func4. The device function Func1 uses a thread local storage variable a, the device function Func2 uses a thread local storage variable b, the device function Func3 uses a thread local storage variable c, and the device function Func4 uses a thread local storage variable d. The thread local storage variables a, b, c, d are target thread local storage variables. It is known that there is no shared thread local storage variable between Kernel functions Kernel 1 and Kernel 2, which can be allocated compactly in order. In this case, there is no gap in the thread local memory, and there is no space waste of the thread local memory.
In the code compiling method according to at least one embodiment of the present disclosure, a separate thread local memory space is allocated for each kernel function, so that the "allocation start address in the thread local memory" refers to an offset of the start address with respect to each thread local memory space. It should be noted that, in fig. 4A and fig. 4B, the stack is used to illustrate allocation of the target thread local storage variable, and the stack herein may be regarded as a virtual concept. In LLVM, when the intermediate representation is converted to machine code, the allocation on the stack is mapped to the actual storage space of the target architecture, i.e., to the corresponding thread local memory space according to the location of the target thread local storage variable on the stack.
The thread Local memory may be used to store Local (Local) variables other than the target thread Local memory variable. For example, as shown in fig. 4A and 4B, a local variable may be further included in the stack corresponding to each kernel function.
The compiler can automatically manage and optimize memory allocation. By optimizing the frame layout (frame layout) of the target thread local storage variable, the utilization rate of the stack can be effectively improved, and the total occupied space of the thread local memory can be reduced.
The code compiling method provided by at least one embodiment of the present disclosure may further include the following step S105. Step S105 may be performed before step S102.
Step S105, inserting an instruction for initializing the target thread local storage variable used by each first function into the intermediate representation corresponding to the code.
For example, in step S105, a plurality of first functions and a plurality of second functions may be included in the code. The target thread local storage variable used directly or indirectly by each first function can be determined according to the call graph. For each first function, an instruction may be inserted at the start position of its corresponding intermediate representation to initialize the target thread local storage variable that it uses directly or indirectly.
For example, taking the target thread local storage variable a as an example, one example of an instruction to initialize the target thread local storage variable a is as follows:
%0 = call ptr @llvm.threadlocal.address.p0(ptr addrspacecast (ptr addrspace(1) @a to ptr))
store i32 1, ptr %0, align 4
The meaning of the above intermediate representation is that the target thread first obtains the address of the local store variable a by the llvm.thread.address.p0 instruction, and then stores the initial value of the variable a (1 in this example) into the address by the store instruction. It should be noted that the above initialization instruction is only an example, and different initialization instructions may be inserted according to actual needs.
It should be noted that, the initialization command is updated in step S102, that is, the llvm.linear address.p0 command is replaced with the thread local memory address corresponding to the target thread local memory variable a, so that the initialization command introduced here does not bring about access overhead of reading and writing addresses from the register.
In a CPU architecture, thread local storage variables are typically initialized by the runtime environment at program start-up. Different from the above initialization mode, the code compiling method provided by at least one embodiment of the present disclosure inserts an instruction for initializing the local storage variable of the target thread during compiling, rather than initializing the local storage variable of the target thread during running, so that refreshing of the local memory of the whole thread during running can be effectively avoided, and optimization of performance is realized. Furthermore, this approach increases the opportunities for more compilation optimization, reduces the coupling between compilation and runtime, and reduces the workload of compilation and runtime.
For example, in at least one embodiment of the present disclosure, before step S105, the use condition of the local storage variable of each thread may be analyzed for each first function, and for the thread that is not used, the corresponding initialization instruction is not required to be inserted. For example, the analysis operations described above may be implemented by a compiler. By the method, unnecessary initialization operation can be reduced, and performance improvement is realized.
For example, to further optimize performance, redundant instruction optimization may also be performed on the intermediate representation corresponding to the code prior to step S101.
For example, in some examples, instructions to obtain an address of a thread-local storage variable may be removed from the intermediate representation to which the code corresponds for an unused thread-local storage variable.
For example, the intermediate representation automatically generated by the Clang front end of the LLVM may be analyzed, i.e., the usage of the thread-local storage variables may be analyzed for each first function, and for unused thread-local storage variables, the instruction for obtaining the address of the thread-local storage variable may be removed from the intermediate representation corresponding to the first function.
For example, in other examples, duplicate instructions to fetch the addresses of the thread local storage variables may be removed from the intermediate representation to which the code corresponds such that the addresses are fetched only once for each thread local storage variable within each first function. For example, if there are multiple identical instructions in the intermediate representation corresponding to a certain first function for obtaining the addresses of the thread local storage variables, then the duplicate instructions need to be removed, leaving only the first instruction.
A specific example is given below, assuming that c is an array type variable, for a kernel function using thread local store variables c [1], c [2], it is necessary to remove instructions in the intermediate representation for obtaining the addresses of thread local store variables c [1] and c [2], leaving only instructions for obtaining the base address of thread local store variable c, the addresses of c [1] and c [2] being calculated by adding a corresponding offset to the base address of c.
By the method, repeated address acquisition operation can be avoided, the running efficiency is remarkably improved, and the expenditure of redundant instructions is reduced.
The code compiling method provided by at least one embodiment of the present disclosure may further include the following step S106. Step S106 may be performed before step S101.
Step S106, obtaining the record of the thread local memory size required by all target thread local storage variables used by each first function in the code, so as to pre-allocate fixed thread local memory space for each first function according to the record.
Frame degradation (Frame Lowering) is a stage of LLVM backend responsible for converting abstract stack frame operations in the intermediate representation into machine code suitable for the target hardware platform, whose main task is to generate the prolog (Prologue) and trailer (Epilogue) of the function call, i.e., stack frame management code at the function entry and exit. These instructions are primarily used to set and restore stack frames (STACK FRAME) to enable functions to function properly and to maintain the integrity of the call chain.
For example, the thread local memory size required for each kernel function may be determined at the compile stage by step S101. In step S106, the thread local memory size corresponding to all the target thread local memory variables used by each kernel function may be recorded by metadata (metadata). It should be noted that, the thread local memory is also used to store local variables other than the target thread local memory variable, where the thread local memory size that is only needed to be occupied by all target thread local memory variables used by the kernel function is recorded, and the thread local memory size occupied by other local variables is not included. For example, a compiler typically embeds a piece of metadata when generating an executable corresponding to a first function. Metadata may be considered an additional information that may be used to guide compiler optimization, debugging, code analysis, or runtime behavior. The metadata recording the thread local memory size required by each kernel function can be acquired in the early stage of back-end code generation, and a fixed thread local memory space is allocated in advance for each kernel function accordingly. The life cycle of the pre-allocated thread local memory space is consistent with the execution cycle of the corresponding kernel function, that is, the life cycle is up to the end of the corresponding kernel function. For the device function, the target thread local storage variable can also be obtained from the thread local memory space corresponding to the kernel function. Thus, by way of the pre-allocation described above, additional processing of the preamble and tail of each function may not be required to manage the target thread local storage variables during the frame degradation stage, as the space for these variables has been reserved in advance.
FIG. 5 is a flow chart of a method of compiling code for compiling code that includes at least one thread local storage variable according to at least one embodiment of the present disclosure. For example, as shown in fig. 5, a code compiling method according to at least one embodiment of the present disclosure includes the following steps S501 to S506.
Step S501, for each target thread local storage variable, determining a first function using the target thread local storage variable to construct a variable-function mapping relationship. The description of step S501 may refer to the description of step S104 in the above embodiment, and will not be repeated here.
Step S502, responding to existence of target thread local storage variables, and distributing the at least one target thread local storage variable to a thread local memory based on the attribute of the at least one target thread local storage variable to obtain a thread local memory address corresponding to each target thread local storage variable. The description of step S502 may refer to the description of step S101 in the above embodiment, and will not be repeated here.
Step S503, obtaining the record of the thread local memory size required by the local storage variable of all target threads used by each first function in the code, so as to pre-allocate fixed thread local memory space for each first function according to the record. The description of step S503 may refer to the description of step S106 in the above embodiment, and will not be repeated here.
Step S504, inserting an instruction for initializing the target thread local storage variable used by each first function into the intermediate representation corresponding to the code. The description of step S504 may refer to the description of step S105 in the above embodiment, and will not be repeated here.
Step S505 is to update the intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local memory variable. The description of step S505 may refer to the description of step S102 in the above embodiment, and will not be repeated here.
And step S506, generating a machine code corresponding to the code based on the updated intermediate representation. The description of step S506 may refer to the description of step S103 in the above embodiment, and will not be repeated here.
It should also be noted that, in the various embodiments of the present disclosure, the execution order of the steps of the code compiling method is not limited, and although the execution of the steps is described in a specific order above, this does not constitute a limitation to the embodiments of the present disclosure. The individual steps in the code compilation method may be performed in series or in parallel, depending on the actual requirements.
For example, the code compiling method provided by at least one embodiment of the present disclosure may further include more or fewer steps than the above description, and the embodiment of the present disclosure is not limited thereto.
Fig. 6 is a schematic block diagram of a code compiling apparatus according to at least one embodiment of the present disclosure. The code compiling means may be, for example, a compiler for compiling code comprising at least one thread local storage variable.
For example, as shown in fig. 6, a code compiling apparatus provided in at least one embodiment of the present disclosure includes an allocation module 601, an update module 602, and a generation module 603.
For example, the allocation module 601 is configured to allocate, in response to the existence of the target thread local storage variable, the at least one target thread local storage variable to the thread local memory based on an attribute of the at least one target thread local storage variable, to obtain a thread local memory address corresponding to each target thread local storage variable, where the target thread local storage variable is a thread local storage variable used by the first function in the at least one thread local storage variable. For the relevant content of the allocation module 601, reference may be made to the relevant description of step S101 in the above-mentioned embodiment of the code compiling method, which is not repeated here.
For example, the update module 602 is configured to update the intermediate representation corresponding to the code based on the thread local memory address corresponding to each target thread local storage variable. For the relevant content of the update module 602, reference may be made to the relevant description of step S102 in the above-mentioned embodiment of the code compiling method, which is not repeated here.
For example, the generation module 603 is configured to generate a machine code corresponding to the code based on the updated intermediate representation. For the relevant content of the generating module 603, reference may be made to the relevant description of step S103 in the foregoing embodiment of the code compiling method, which is not repeated here.
For example, in at least one embodiment of the present disclosure, the code compilation apparatus 600 may further include an initialization module configured to insert instructions in the intermediate representation corresponding to the code that initialize the target thread local storage variables used by each first function.
For example, in at least one embodiment of the present disclosure, the code compilation apparatus 600 may further include a building module configured to, for each target thread local storage variable, determine a first function using the target thread local storage variable to build a variable-function mapping relationship.
For example, in at least one embodiment of the present disclosure, the first function directly uses the target thread local storage variable, or uses the target thread local storage variable by invoking the second function.
For example, in at least one embodiment of the present disclosure, the first function is a function that is invoked by the host side and executed by the device side, and the second function is a function that is invoked by the device side and executed by the device side.
For example, in at least one embodiment of the present disclosure, the allocation module includes a determination unit and an allocation unit. The determining unit is configured to determine a variable allocation order based on an attribute of the at least one target thread local storage variable. The allocation unit is configured to allocate the at least one target thread local storage variable to the thread local memory in sequence according to a variable allocation order.
For example, in at least one embodiment of the present disclosure, the attribute of the target thread local storage variable includes at least one of a first function number corresponding to the target thread local storage variable, an alignment parameter of the target thread local storage variable, a size of the target thread local storage variable, and a defined order of the target thread local storage variable.
For example, in at least one embodiment of the present disclosure, the determining unit is further configured to rank the at least one target thread local storage variable in descending order of a first function number corresponding to the at least one target thread local storage variable, rank the at least one target thread local storage variable in descending order of an alignment parameter of the at least one target thread local storage variable, rank the at least one target thread local storage variable in ascending order of a size of the at least one target thread local storage variable, or rank the at least one target thread local storage variable in a defined order of the at least one target thread local storage variable.
For example, in at least one embodiment of the present disclosure, the determining unit is further configured to rank at least one target thread local storage variable in descending order of a first function number corresponding to the at least one target thread local storage variable, rank the plurality of first thread local storage variables in descending order of an alignment parameter of the target thread local storage variable in response to the presence of the corresponding first function number identical in the one target thread local storage variable, rank the plurality of second thread local storage variables in ascending order of a size of the target thread local storage variable in response to the presence of the plurality of second thread local storage variables in the plurality of first thread local storage variables, and rank the plurality of third thread local storage variables in ascending order of a definition of the target thread local storage variable in response to the presence of the plurality of third thread local storage variables in the plurality of second thread local storage variables.
For example, in at least one embodiment of the present disclosure, the allocation module may further include a start address determining unit configured to, for the target thread local storage variable corresponding to the plurality of first functions, use a maximum value of the allocated offsets corresponding to the plurality of first functions as an allocation start point of the target thread local storage variable in the thread local memory.
For example, in at least one embodiment of the present disclosure, the intermediate representation corresponding to the code includes instructions for obtaining addresses of target thread local storage variables, and the update module is further configured to, for each target thread local storage variable, replace the instructions for obtaining addresses of target thread local storage variables with thread local memory addresses corresponding to the target thread local storage variables.
For example, in at least one embodiment of the present disclosure, the code compiling apparatus 600 may further include a record obtaining module configured to obtain a record of a thread local memory size required for all target thread local storage variables used by each first function in the code, so as to pre-allocate a fixed thread local memory space for each first function according to the record.
For example, in at least one embodiment of the present disclosure, the lifecycle of the pre-allocated thread local memory space coincides with the execution cycle of the corresponding first function.
For example, in at least one embodiment of the present disclosure, thread local memory is located in a graphics processor.
It should be noted that the various modules and units described above may be implemented by software, hardware, firmware, or any combination thereof, for example, the allocation module, the update module, and the generation module may be implemented as an allocation circuit, an update circuit, and a generation circuit, respectively, and the embodiments of the present disclosure are not limited to their specific embodiments.
It should be understood that the code compiling apparatus 600 provided in at least one embodiment of the present disclosure may be used to implement the foregoing code compiling method, and may also achieve similar technical effects as the foregoing code compiling method, which is not described herein.
It should be noted that, in the embodiment of the present disclosure, the code compiling apparatus 600 may include more or less modules or units, and the connection relationship between the respective modules or units is not limited, and may be determined according to actual requirements. The specific configuration of each module or unit is not limited, and may be constituted by an analog device according to a circuit principle, a digital chip, or other applicable means.
Fig. 7 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure.
For example, as shown in fig. 7, an electronic device 700 includes at least one processor 701 and at least one memory 702. The at least one memory 702 includes one or more computer program modules. One or more computer program modules are stored in the memory 702 and configured to be executed by the at least one processor 701, the one or more computer program modules comprising instructions for performing the code compilation method described above, which when executed by the at least one processor 701, may perform one or more steps of the code compilation method provided by at least one embodiment of the present disclosure. The memory 702 and the processor 701 may be interconnected by a bus system and/or other forms of connection mechanisms (not shown).
For example, the processor 701 may be a Central Processing Unit (CPU), digital Signal Processor (DSP), image processor (GPU), general Purpose Graphics Processor (GPGPU), artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) accelerator, or other form of processing unit having data processing and/or program execution capabilities, such as a Field Programmable Gate Array (FPGA), etc., for example, the Central Processing Unit (CPU) may be an X86, ARM, RISC-V architecture, etc. The processor 701 may be a general purpose processor or a special purpose processor that may control other components in the electronic device 700 to perform the desired functions.
For example, memory 702 may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or Cache memory (Cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like.
Fig. 8 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure.
The electronic device in at least one embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, personal Digital Assistants (PDAs), tablet computers (PADs), portable Multimedia Players (PMPs), in-vehicle terminals (e.g., in-vehicle navigation terminals), wearable electronic devices, and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
The electronic device includes at least one processor and a memory. The processor herein may be referred to as a processing device 801 as described below, and the memory may include at least one of Read Only Memory (ROM), random Access Memory (RAM), and a storage device 808 as described below. The memory is used for storing a program for executing the method according to the above-mentioned method embodiments, and the processor is configured to execute the program stored in the memory. The processor may be a processing unit including a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device to perform the desired functions.
As shown in fig. 8, the electronic device 800 may include a processing means 801 (e.g., a central processing unit, a graphics processor, etc.) that may perform various appropriate actions and processes according to programs stored in a read-only memory (ROM) or programs loaded from a storage means 808 into a Random Access Memory (RAM). In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface is also connected to bus 804.
In general, devices may be connected to I/O interface 805 including input devices 806, including for example, touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc., output devices 807, including for example, displays, speakers, vibrators, etc., storage devices 808, including for example, magnetic tape, hard disk, etc., and communication devices 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, in accordance with at least one embodiment of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, at least one embodiment of the present disclosure includes a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the method of at least one embodiment of the present disclosure are performed when the computer program is executed by the processing device 801.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In at least one embodiment of the present disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In yet another embodiment of the present disclosure, a computer-readable signal medium may comprise a propagated data signal with computer-readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, radio Frequency (RF), and the like, or any suitable combination of the foregoing.
The computer readable medium may be included in the electronic device 800 or may exist alone without being incorporated into the electronic device 800.
Fig. 9 is a schematic block diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure.
For example, as shown in fig. 9, a non-transitory computer-readable storage medium 900 has stored thereon computer-readable instructions 901, which when executed by at least one processor perform one or more steps of the code compilation method described above.
For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), portable compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media. For example, the readable storage medium may also be the memory 702 in fig. 7, and the related description may refer to the foregoing, which is not repeated herein.
While the disclosure has been described in detail with respect to the general description and the specific embodiments thereof, it will be apparent to those skilled in the art that certain modifications and improvements may be made thereto based on the embodiments of the disclosure. Accordingly, such modifications or improvements may be made without departing from the spirit of the disclosure and are intended to be within the scope of the disclosure as claimed.
For the purposes of this disclosure, the following points are also noted:
(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.
(2) In the drawings for describing embodiments of the present disclosure, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale.
(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.
The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims (15)

CN202510534557.1A2025-04-272025-04-27Code compiling method, electronic device and storage mediumActiveCN120122957B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510534557.1ACN120122957B (en)2025-04-272025-04-27Code compiling method, electronic device and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510534557.1ACN120122957B (en)2025-04-272025-04-27Code compiling method, electronic device and storage medium

Publications (2)

Publication NumberPublication Date
CN120122957Atrue CN120122957A (en)2025-06-10
CN120122957B CN120122957B (en)2025-07-08

Family

ID=95923219

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510534557.1AActiveCN120122957B (en)2025-04-272025-04-27Code compiling method, electronic device and storage medium

Country Status (1)

CountryLink
CN (1)CN120122957B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN105786525A (en)*2016-03-232016-07-20鼎点视讯科技有限公司Method and device for transplanting code from process model to thread model
WO2016176058A1 (en)*2015-04-282016-11-03Microsoft Technology Licensing, LlcCompiler optimization of coroutines
CN106445656A (en)*2016-09-062017-02-22北京邮电大学Method and device for realizing thread local storage
CN119127345A (en)*2024-08-122024-12-13浙江大华技术股份有限公司 Data processing method and device, storage medium and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
WO2016176058A1 (en)*2015-04-282016-11-03Microsoft Technology Licensing, LlcCompiler optimization of coroutines
CN105786525A (en)*2016-03-232016-07-20鼎点视讯科技有限公司Method and device for transplanting code from process model to thread model
CN106445656A (en)*2016-09-062017-02-22北京邮电大学Method and device for realizing thread local storage
CN119127345A (en)*2024-08-122024-12-13浙江大华技术股份有限公司 Data processing method and device, storage medium and electronic device

Also Published As

Publication numberPublication date
CN120122957B (en)2025-07-08

Similar Documents

PublicationPublication DateTitle
CN107347253B (en) Hardware instruction generation unit for special purpose processors
CN107291480B (en)Function calling method and device
US12430108B2 (en)Multistage compiler architecture
US11733983B2 (en)Method and apparatus for generating metadata by a compiler
CN116523023A (en)Operator fusion method and device, electronic equipment and storage medium
Vinas et al.Improving OpenCL programmability with the heterogeneous programming library
US9760282B2 (en)Assigning home memory addresses to function call parameters
US11915056B2 (en)Combination of multiple data processing and machine learning frameworks for a target hardware
US10496433B2 (en)Modification of context saving functions
CN119440672B (en) Function library linking method, device, equipment, storage medium and program product
CN114391155B (en) GPU shader program iterative calling method, GPU, compiler and GPU driver
CN118747072A (en) Compilation method, electronic device and storage medium
WO2022013887A1 (en)Apparatus for implementing dynamic, data-dependent parallelism for task execution based on an execution model
CN120122957B (en)Code compiling method, electronic device and storage medium
CN118796342A (en) Instruction processing method, device and virtual machine
US11762641B2 (en)Allocating variables to computer memory
US12190086B1 (en)Method and apparatus for ML graphs by a compiler
US11573777B2 (en)Method and apparatus for enabling autonomous acceleration of dataflow AI applications
CN120215957A (en) Code compilation method, electronic device and storage medium
CN120687075A (en)Code division method, system, equipment and medium for heterogeneous multi-core processor
CN120704693A (en)Compiling method, electronic device and storage medium
Shaaban Ibrahim ali et al.Leveraging iterative applications to improve the scalability of task-based programming models on distributed systems
CN118585313A (en) Memory reuse optimization method, device, equipment, storage medium and program product
CN118963738A (en) Operator framework, construction method and operator generation method based on code generator Codegen
BainesRcuda: General programming facilities for gpus in r

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp