Detailed Description
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. In this regard, directional terminology, such as "top," "bottom," "front," "rear," "leading," "trailing," etc., is used with reference to the orientation of the figure being described. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims. It is to be understood that features of the various exemplary embodiments described herein may be combined with each other, unless specifically noted otherwise.
FIG. 1 is a computer code diagram illustrating an embodiment of code 10 having an agile communication operator 12. When compiled and executed, agile communication operator 12 generates a segmented computation space based on a resource graph to distribute the computation space across compute nodes (e.g., compute node 121 shown in FIG. 4 and described in more detail below). The agile communication operator decomposes the computation space (represented by input indexable type 14 in the embodiment of FIG. 1) into segments 20 of agile indexable type 18 (also shown in the example of FIG. 3B), causes segments 20 to be assigned to compute nodes, and allows a user to centrally manage and automate movement of segments 20 among compute nodes. The segment movement may be managed using a full global-view representation or a local global-view representation of the segments, as described in more detail below.
Code 10 includes a sequence of instructions from a high level general purpose or data parallel programming language that may be compiled into one or more executables (e.g., DP executable 138 shown in FIG. 4) for execution by one or more DP optimal compute nodes (e.g., DP optimal compute node 121 shown in FIG. 4).
In one embodiment, code 10 includes a sequence of instructions from a high-level general purpose programming language with data parallel extensions (hereafter GP language) that form a program stored in a collection of one or more modules. The GP language may allow programs to be written in different parts (i.e., modules) so that each module may be stored in a separate file or location accessible by the computer system. The GP language provides a single language for programming a computing environment that includes one or more general purpose processors and one or more special purpose DP optimal compute nodes. The DP optimal compute node is typically a Graphics Processing Unit (GPU) or SIMD unit of a general purpose processor, but may also include a scalar or vector execution unit of a general purpose processor, a Field Programmable Gate Array (FPGA), or other suitable device in some computing environment. Using the GP language, a programmer may include both general purpose processor and DP source code in code 10 for execution by general purpose processors and DP compute nodes, respectively, and coordinate the execution of the general purpose processor and DP source code. In this embodiment, code 10 may represent any suitable type of code, such as an application, library function, or operating system service.
GP language canFormed by extending a widely applicable high-level and general-purpose programming language, such as C or C + +, to include data parallel features. Other examples of general function languages in which DP features may occur include: java (Java)TM,PHP,Visual Basic,Perl,PythonTMC #, Ruby, Delphi, Fortran, VB, F #, OCaml, Haskell, Erlang, NESL, Chapel, and JavaScriptTM. The GP language implementation may include rich linking capabilities that allow different portions of a program to be included in different modules. The data parallel feature provides a programming tool that utilizes a dedicated architecture of the DP optimal compute node to allow data parallel operations to be performed faster or more efficiently than using a general purpose processor (i.e., a non-DP optimal compute node). The GP language may also be another suitable high level general purpose programming language that allows a programmer to program both general purpose processors and DP optimal compute nodes.
In another embodiment, code 10 includes a sequence of instructions from a high level data parallel programming language (hereinafter DP language) that form a program. The DP language provides a specific language for programming DP optimal compute nodes in a computing environment having one or more DP optimal compute nodes. Using the DP language, a programmer generates DP source code in code 10 that is intended for execution on a DP optimal compute node. The DP language provides a programming tool that utilizes a dedicated architecture of the DP optimal compute node to allow data parallel operations to be performed faster or more efficiently than using a general purpose processor. The DP language may be an existing DP programming language such as HLSL, GLSL, Cg, C + +, NESL, Chapel, CUDA, OpenCL, operator, Ct, PGI GPGPU operator, CAPS GPGPU operator, Brook +, CAL, APL, Fortran 90 (and higher), Data Parallel C, DAPPLE, or APL. In this embodiment, code 10 may represent any suitable type of DP source code, such as an application, library function, or operating system service.
Code 10 includes portions of code designated for execution on a DP optimal compute node. In the embodiment of fig. 1 in which code 10 is written using the GP language, the GP language allows a programmer to specify GP source code using annotations 26 (e.g., __ decspec (vector)..) when defining vector functions. The annotation 26 is associated with a function name 27 (e.g., vector _ function) of a vector function intended for execution on the DP optimal compute node. The code 10 may also include one or more calls 28 (e.g., call, vector _ func) to vector functions at call points such as call, reduce, scan, or sort. The vector function corresponding to the call point is called a kernel function. The kernel function may call other vector functions in code 10 (i.e., other DP source code) and may be considered the root in the vector function call graph. Kernel functions may also use the type (e.g., class or structure) defined by code 10. These types may or may not be annotated as DP source code. In other embodiments, other suitable programming language constructs may be used to designate portions of code 10 as DP source code and/or general purpose processor code. Furthermore, in embodiments where code 10 is written using the DP language, annotations 26 may be omitted.
FIG. 2 is a block diagram illustrating an embodiment of applying agile communication operator 12 to an input indexable type 14 to generate an agile indexable type 18. As used herein, an indexable type is any data type that implements one or more subscript operators (subscriptoperators) with a rank that is a non-negative integer and a type that is denoted as element _ type. If index < N > is a type of N-tuple representing an integer (i.e., any type of integer data type), then an instance of index < N > is a set of N integers { i0, i 1.,. im }, where m equals N-1 (i.e., an N-tuple). The index operator of rank N takes an N-tuple instance of index < N > and associates that instance with another instance of a type called element type, where the element type defines each element in the indexable type. In one embodiment, an indexable type defines one or more of the following operators:
wherein index _ resolver takes the form of at least one of:
const index<rank>&idx;
const index<rank>idx;
index<rank>&idx;
index<rank>idx.
in other embodiments, the operator may be a function, or a more general representation. The shape of an indexable type is a set of index < rank > defined by one of the subscript operators described above. Indexable types generally have the shape of a polyhedron-that is, an indexable type can be represented algebraically as the intersection of a finite number of half-spaces formed by linear functions of coordinate axes.
Referring to FIGS. 1 and 2, in one embodiment, the high-level language of code 10 provides agile communication operator 12 for use on input indexable type 14 in a data parallel computing environment. Input indexable type 14 has a rank (e.g., rank N in the embodiment of FIG. 1) and an element type (e.g., element type T in the embodiment of FIG. 1) and defines a computation space that can be operated on by agile communication operator 12. Agile communication operator 12 receives input indexable type 14 and resource map 16 (e.g., resource _ map in the example of FIG. 1). From input indexable type 14 and resource map 16, agile communication operator 12 generates an agile indexable type 18 with a segment 20 (also referred to as a submesh) specified by resource map 16 (also shown in the example of FIG. 3B) as shown in code 10, agile communication operator 12 operable to pass agile indexable type 18 to a DP call site (i.e., forall in the example of FIG. 1). By doing so, agile communication operator 12 causes the vector function specified by the call point to be replicated on all compute nodes (e.g., compute node 121 shown in FIG. 4), each of which receives a segment 20 assigned to that compute node.
Agile communication operator 12 causes input indexable type 14 to be decomposed into segments 20 and assigns each segment 20 to a compute node as specified by resource map 16. Resource map 16 provides a specification of where memory (i.e., input indexable type 14) is stored across at least one compute node. Resource map 16 specifies segments 20 such that the collection of segments 20 covers flexible indexable types 18 without overlapping. Resource map 16 allows segments 20 to be specified with the same or different block sizes and/or regular or irregular block combinations.
3A-3C are block diagrams illustrating examples of generating and using a flexible indexable type 18 (1). In the example of fig. 3A-3C, agile communication operator 12 divides a 6x6 matrix (e.g., input indexable type 14(1)) having elements numbered 0 through 35 into 9 segments 20 of the type agile indexable type 18(1) shown in fig. 3B, as specified by the respective resource map 16 (shown in fig. 2). Each segment 20 is represented by a different shading in fig. 3B. For example, a first segment 20(1) includes elements 0, 1, 6, and 7, a second segment 20(2) includes elements 2, 3, 8, and 9, and so on. Agile communication operator 12 also causes segments 20(1) -20(9) to be assigned to a set of one or more compute nodes 121(1) -121(Q), where Q is an integer greater than or equal to 1, as specified by the protocol in resource map 16 and indicated by arrow 30 of fig. 3C.
The resource map 18 may incorporate any suitable assignment protocol, such as block decomposition, ring decomposition, block-block decomposition, or block-ring decomposition. The following protocol example assumes the presence of 3 compute nodes 121(1) -121(3) (i.e., Q ═ 3) or 4 compute nodes 121(1) -121(4) (i.e., Q ═ 4) and assumes that segment 20 is numbered 20(1) -20(9) in the rows spanned from left to right starting from the first row (i.e., the top row).
With row block decomposition and Q ═ 3, the 36 elements of the input indexable type 14(1) are divided by 3 so that 12 elements are assigned to each compute node 121. Thus, the resource map 18 causes elements 0 through 11 (i.e., segments 20(1) -20(3)) to be assigned to compute node 121(1), elements 12 through 23 (i.e., segments 20(4) -20(6)) to be assigned to compute node 121(2), and elements 24-35 (i.e., segments 20(7) -20(9)) to be assigned to compute node 121 (3).
With row block decomposition and Q ═ 4, the 36 elements of the input indexable type 14(1) are divided by 4 so that 9 elements are assigned to each compute node 121. Accordingly, the resource map 18 causes elements 0 through 8 to be assigned to compute node 121(1), elements 9 through 17 to be assigned to compute node 121(2), elements 18 through 26 to be assigned to compute node 121(3), and elements 27 through 36 to be assigned to compute node 121 (4).
With column block decomposition and Q ═ 3, resource map 18 causes first and second column segments 20 (i.e., segments 20(1), 20(4), and 20(7)) to be assigned to compute node 121(1), third and fourth column segments 20 (i.e., segments 20(2), 20(5), and 20(8)) to be assigned to compute node 121(2), and fifth and sixth column segments 20 (i.e., segments 20(3), 20(6), and 20(9)) to be assigned to compute node 121 (3).
With row ring decomposition and Q3, the resource map 18 causes element (3 k) (for k 0 to 11) to be assigned to compute node 121(1), element (3 k +1) to be assigned to compute node 121(2), and element (3 k +2) to be assigned to compute node 121 (3).
With row ring decomposition and Q4, the resource map 18 causes element (4 k) (for k 0 to 8) to be assigned to compute node 121(1), element (4 k +1) to be assigned to compute node 121(2), element (4 k +2) to be assigned to compute node 121(3), and element (4 k +3) to be assigned to compute node 121 (4).
In the case of row-block-ring decomposition and Q ═ 3, the decomposition is a ring decomposition to the segments 20(1) -20(9) shown in fig. 3B. Accordingly, the resource map 18 causes segments 20(1), 20(4), and 20(7) to be assigned to compute node 121(1), segments 20(2), 20(5), and 20(8) to be assigned to compute node 121(2), and segments 20(3), 20(6), and 20(9) to be assigned to compute node 121 (3).
With row block-ring decomposition and Q ═ 4, resource map 18 causes segments 20(1), 20(5), and 20(9) to be assigned to compute node 121(1), segments 20(2) and 20(6) to be assigned to compute node 121(2), segments 20(3) and 20(7) to be assigned to compute node 121(3), and segments 20(4) and 20(8) to be assigned to compute node 121 (4).
With row block-block decomposition and Q ═ 3, resource map 18 causes segments 20(1) -20(3) to be assigned to compute node 121(1), segments 20(4) -20(6) to be assigned to compute node 121(2), segments 20(7) -20(9) to be assigned to compute node 121 (3).
The row or column resolution decision of resource map 16 may depend on the memory layout. For example, a column-major memory layout may imply column decomposition using an appropriate protocol.
In one embodiment, resource map 16 includes a collection of resource segments, where each resource segment associates a segment 20 with a resource view (i.e., an abstraction of a compute node) (not shown). For example, for indexable type 14 defined by the following statements:
grid<rank>parent_grid;
where grid < rank > contains two data members:
extent<rank>_M_extent;
index<rank>_M_offset;
for example, in the second segment 20(2) of fig. 3B, the shape or mesh has _ M _ extent ═ {2, 2} and _ M _ offset ═ 0, 1}, and the sixth segment 20(6) has _ M _ extent ═ 2, 2} and _ M _ offset ═ {1, 2 }. Accordingly, parent _ grid can be decomposed using the following statement:
grid<rank>algorithmic_blocks[M1];
grid<rank>memory_blocks[M2];
grid<rank>compute_nodes[M3];
wherein M1, M2, M3 > 0 and M1 ═ M2 ═ M3. Typically, M3 partitions M2 and M2 partitions M1. The algorithmic _ blocks (algorithm blocks), memory _ blocks (memory _ blocks), and computer _ nodes (compute _ nodes) all three overlay parent _ grid without overlap. algorithmic _ blocks represents the decomposition used in the implemented algorithm. memory _ blocks represents the granularity of memory that is moved between nodes when necessary. Computer _ nodes represents the granularity assigned to a Compute node to store the corresponding data.
Assume that there is an association such that each algorithmic _ block or memory _ block can find where it is stored on a computer _ node, and such that each algorithmic _ block can find where it is stored on a memory _ block. A class called resource map (resource _ map) may be generated with resource segments that form associations between the sub-grids and resource view (resource _ view).
By using agile communication operator 12, data of agile indexable type 18 can be seamlessly accessed without the user knowing the compute node where the data currently resides. For the example indexable type 14, A with a shape parent _ grid, the storage of A is determined by the instance of resource _ segment. To access the element of A at the statement:
index<rank>_Index;
first find the child-grid containing _ Index, and then determine the offset (offset):
index<rank>_Offset;
so that:
_Index=child-grid-offset+_Offset。
in the case of resource _ segment comments, the relationship is:
_Index=_Resource_segment._M_child._M_offset+_Offset。
to increase the speed of the lookup, the following check is performed to determine if _ Index (when _ Index changes) still belongs to _ Resource _ segment _ M _ child:
the determination of the sub-grid or _ Resource _ segment to which a given _ Index belongs depends on the decomposition mode. In the worst case, a binary search in each dimension may be used but may not be avoided. However, in the case where all tiles have an equal range of 2048 × 2048 tile decompositions (tile decompositions), for example, finding _ M _ child. _ M _ offset equals _ Resource _ segment of the following equation:
index<2>_Tile(2048,2048);
(_Index+_Tile-1)/_Tile.
the _ Resource _ segment (i.e., the current Resource _ segment) can be used until:
if(_Local_offset<_Local_bounds){...}
is violated, in which case the new _ Index is again split with _ Tile and repeated. This mechanism is optimal for algorithms with locality that only needs to find newly contained resource segments infrequently.
In the local global view representation described below, the user index operator may omit the if-check (referred to herein as the boundary check):
if(_Local_offset<_Local_bounds){...}
since the user is trusted to be within the boundary at each access. If a given resource segment is exhausted and another is to be used, the user is trusted to invoke a function that resets the current resource segment. In the simplest form of a local global view, all three are decomposed:
grid<rank>algorithmic_blocks[M1];
grid<rank>memory_blocks[M2];
grid<rank>compute_nodes[M3];
are regular, having blocks or small blocks (tiles) of the same size. A tile communication operator that partitions the indexable type into tiles can be applied to the first decomposition to produce:
algorithmic_tiles
the individual patches were:
algorithmic_tiles(_tile_index)。
when owner replication starts on the following statements:
algorithmic_tiles(_tile_index)
the contained memory _ blocks [ k1] and computer _ nodes [ k2] are determined. Next, the owner computer _ nodes [ k3] is determined, and then the memory _ blocks [ k1] is moved from computer _ nodes [ k2] to computer _ nodes [ k3 ].
The automated memory movement granularity is typically finer than the sub-grid decomposition of the fragments 20. For example, assume that the matrix of FIG. 3A represents a 6144x6144 element matrix, i.e., each numbered algorithmic block represents 1024x1024 data elements. Assume that the 6144x6144 matrix is decomposed into 2048x2048 computer _ nodes blocks, such as in FIG. 3B. Further, assume that Q is 4 and compute nodes 121(1), 121(2), 121(3), and 121(4) are assigned to 2048 × 2048 blocks (i.e., segments 20(1) -20(9)) according to block-loop decomposition. Then segments 20(1), 20(5), and 20(9) are assigned to compute node 121(1), segments 20(2), and 20(6) are assigned to compute node 121(2), segments 20(3), and 20(7) are assigned to compute node 121(3), and segments 20(4), and 20(8) are assigned to compute node 121 (4). The memory may be moved in 1024x1024 blocks in this example. Accordingly, if the computation seeks to move a single data element from a 1024x1024 block by a single data element, the entire 1024x1024 block is moved.
Agile communication operator 12 allows Data Parallel (DP) algorithms to be encoded with either a full global-view representation or a local global-view representation of segments 20 of agile indexable type 18 to manage movement of segments 20 between compute nodes.
The full global-view representation allows the DP algorithms to be encoded as if they were going to run on a single compute node while the automated owner copy memory movement is going to be done in the background. As an example of matrix addition, assume A, B and C are each 6144x6144 matrices as shown in fig. 3A, where each numbered block represents 1024x1024 data elements. A and B carry valid data, but C is allocated but does not necessarily carry any data. Further assume that A, B and C are each assigned on compute nodes 121(1) -121(Q), where Q equals 4 in this case and where segments 20(1) -20(9) of each of A, B and C are stored on compute nodes 121(1) -121(Q), respectively. The following calculations were used:
c ═ a + B: wherein C (i, j) ═ a (i, j) + B (i, j): 0 < ═ i, j < 6
Each (i, j) represents 1024x1024 elements.
Owner replication means that data is moved to compute node 121 if necessary, and the answer, i.e., C, is computed at compute node 121. In this example, the blocks of a and B are moved to compute node 121, where the corresponding block of C is stored as a compute specification. However, for simple matrix addition, no movement is required, since the blocks of a and B and the corresponding blocks of C are stored on the same compute node 121. The following calculations were made:
C(1,2)=A(1,2)+B(1,2)
block 8 in fig. 3B is used for each of A, B and C. Block 8 is part of segment 20(2) stored on compute node 121(2) for each of A, B and C, so no data movement occurs. Similarly, the following calculations are made for the corresponding segment 20 and compute node 121:
on segment 20(1), compute node 121 (1): c (0, 0) ═ a (0, 0) + B (0, 0)
On segment 20(1), compute node 121 (1): c (1, 0) ═ a (1, 0) + B (1, 0)
On segment 20(4), compute node 121 (4): c (2, 0) ═ a (2, 0) + B (2, 0)
On segment 20(4), compute node 121 (4): c (3, 0) ═ a (3, 0) + B (3, 0)
On segment 20(7), compute node 121 (3): c (4, 0) ═ a (4, 0) + B (4, 0)
On segment 20(7), compute node 121 (3): c (5, 0) ═ a (5, 0) + B (5, 0)
Here, a fragment refers to an element of the decomposition:
grid<2>compute_nodes[9]。
in fact:
grid<2>algorithmic_blocks[36];
grid<2>memory_blocks[18];
grid<2>compute_nodes[9];
where the algorithmic _ blocks has a range of 1024x1024, the memory _ blocks has a range of 2048x1024, and the computer _ nodes has a range of 2048x 2048. Thus, matrix addition is a very typical example.
In another example using the assumptions above, the transpose of B is added to A to produce C as follows:
C=A+BT: wherein C (i, j) ═ a (i, j) + B (j, i)T:0<=i,j<6
Where each (i, j) represents 1024x1024 elements and B (j, i)TIs a transpose of the bottom 1024x1024 blocks.
In this case, B (j, i) is moved onto compute node 121, where C (i, j) (and a (i, j)) are stored for all blocks except those in segments 20(1), 20(5), and 20 (9). For example, the blocks of segment 20(1) need not be moved, since the computation of the blocks of segment 20(1) of C is:
C(0,0)=A(0,0)+B(0,0)T
C(0,1)=A(0,1)+B(1,0)T//B(1,0)Tin fragment 20(1) of B
C(1,0)=A(1,0)+B(0,1)T//B(0,1)TIn fragment 20(1) of B
C(1,1)=A(1,1)+B(1,1)T
However, for blocks of segment 20(4) of C:
C(2,0)=A(2,0)+B(0,2)T
C(2,1)=A(2,1)+B(1,2)T
C(3,0)=A(3,0)+B(0,3)T
C(3,1)=A(3,1)+B(1,3)T
the B blocks are from the blocks of segment 20(2) stored on compute node 121(2), and the C blocks are from the blocks of segment 20(4) stored on compute node 121 (4). Accordingly, block 2 of B (i.e., B (0, 2)T) Is moved to compute node 121(4), added to A (2, 0), and assigned to block 8 of C (2, 0), B (i.e., B (1, 2)T) Is moved to compute node 121(4), added to A (2, 1), and assigned to block 3 of C (2, 1), B (i.e., B (0, 3)T) Is moved to compute node 121(4), added to A (3, 0), and assigned to block 9 of C (3, 0), B (i.e., B (1, 3)T) The 1024x1024 elements are moved to compute node 121(4), added to a (3, 1), and assigned to C (3, 1).
Using a full global-view representation, memory movement occurs automatically, as each block carries information which compute node 121 stores the block. These computations may be directed from any one of compute nodes 121 or a host (such as host 101 shown in FIG. 4 and described in more detail below).
In other variations of the above example, multiple segments 20 may be assigned to the same compute node 121, where the number of compute nodes 121 is less than the number of segments 20. Furthermore, the processing power of compute node 121 may be weighted such that more segments 20 may be assigned to faster compute nodes 121 than to slower compute nodes 121. The assignment may be performed according to one or more of the protocols described above.
Automatic load balancing using work-stealing (work-stealing) may also be implemented in the above variant. When compute node 121 completes its computations, compute node 121 attempts to steal computations assigned to other nodes 121. The compute node 121 directing the computation, or possibly the host, may store work-stealing queues of work-items (work-items), where these queues contain tasks representing the computation of memory movement granularity (e.g., 1024x1024) on the owner matrix (e.g., C).
In the example of A, B and C plus the transpose of B from the above matrix, using four equal-weight compute nodes 121(1) -121(4) and the block-loop decomposition protocol, the lower four work-stealing queues may be stored as follows.
Thus, in the above images, and C ═ A + BTAnd memory movement granularity is 1024x1024, and 4 machines are equally weighted (w0 w1 w2 w3 1), and block-ring decomposition:
queue0 (queue 0) consisting of 12 tasks-4 1024 × 1024 blocks for each of segments 20(1), 20(5), and 20 (9);
queue1 (queue 1) consisting of 8 tasks-4 1024x1024 blocks for each of segments 20(2) and 20 (6);
queue2 (queue 2) consisting of 8 tasks-4 1024x1024 blocks for each of segments 20(3) and 20 (7);
queue3 (queue 3) consisting of 8 tasks-4 1024x1024 blocks for each of segments 20(4) and 20 (8).
For example, queue2 includes the following tasks:
C(0,4)=A(0,4)+B(4,0)T
C(0,5)=A(0,5)+B(5,0)T
C(1,4)=A(1,4)+B(4,1)T
C(1,5)=A(1,5)+B(5,1)T
C(4,0)=A(4,0)+B(0,4)T
C(4,1)=A(4,1)+B(1,4)T
C(5,0)=A(5,0)+B(0,5)T
C(5,1)=A(5,1)+B(1,5)T
each compute node 121 obtains a task from the top of its corresponding work-stealing queue until all tasks from the work-stealing queue are completed. When a work-stealing queue of a compute node 121 is empty, compute node 121 steals a task from the bottom of the work-stealing queue corresponding to another compute node 121. The local global view is typically enabled by the tile communication operator at the granularity level algorithmic _ blocks. Assuming a regular tile decomposition, the tile communication operator is applied to the first to produce:
algorithmic_tiles
the individual patches were:
algorithmic_tiles(_tile_index).
when owner replication starts on the following equation:
algorithmic_tiles(_tile_index)
the contained memory _ blocks [ k1] and computer _ nodes [ k2] are determined. Next, the owner computer _ nodes [ k3] is determined, and then the memory _ blocks [ k1] is moved from computer _ nodes [ k2] to computer _ nodes [ k3 ]. This is done at the algorithmic _ tiles (_ tile _ index) access level. When implementing the algorithm, an element (or, recursively, a finer block) is accessed as:
algorithmic_tiles(tile_index)(local_index)
unlike the full global-view representation, the local global-view representation allows memory movement to be explicitly specified by the user. In the full global view representation example described above, the memory movement granularity is 1024x1024 blocks, such that if a compute node 121 accesses a single element in the block, the entire 1024x1024 block is moved to the compute node 121.
In some computations, the granularity of the computation is finer than the memory movement granularity, and the local global view representation provides advantages over a user explicitly directing where each memory block is to be moved. For example, assume that the memory movement granularity is 2048x1024 in the full global view representation example, i.e., two blocks are moved whenever an element is moved from either of the blocks. So for C ═ A + BTThe block of segment 20(4) of C is calculated as:
C(2,0)=A(2,0)+B(0,2)T
C(2,1)=A(2,1)+B(1,2)T
C(3,0)=A(3,0)+B(0,3)T
C(3,1)=A(3,1)+B(1,3)T
in each case, B blocks are stored on compute node 121(2), while C blocks and a blocks (C is the owner) are stored on compute node 121 (4). Thus, by explicitly directing block 2 of B (i.e., B (0, 2)T) Moves to compute node 121(4) to perform the first two of the above statements. Due to 2048x1024 memory granularity, block 2 and block 8 of B (i.e., B (0, 2)TAnd B (1, 2)T) Both are moved to compute node 121(4) to allow the addition of the first two statements to be performed by compute node 121 (4). Also, by explicitly directing block 3 of B (i.e., B (0, 3)T) Moves to compute node 121(4) to perform the last two of the above statements. Due to 2048x1024 memory granularity, block 3 and block 9 of B (i.e., B (0, 3)TAnd B (1, 3)T) Both are moved to compute node 121(4) to allow the addition of the latter two statements to be performed by compute node 121 (4).
As these examples illustrate, the granularity of the computations may be finer than the memory movement granularity (which may be finer than the compute node granularity), so that there may be many tasks performed on a given memory movement block using one or more algorithms. A single indication of an element of a movement block allows the task of operating on that block to be performed more efficiently. Both the user and the implementation may omit the check to see if the memory needs to be moved until the guiding algorithm starts working on another memory movement block, and, as seen above, the tile communication operator corresponding to the algorithmic _ blocks decomposition actually guides the memory movement when necessary. If a 1024x1024 small block (e.g., block 3) is to be moved, the contained 2048x1024 memory moving block (e.g., memory moving block 3) is moved from the contained 2048x2048compute node block (e.g., segment 20(2)) to the owner copy determined 2048x2048 block (e.g., segment 20 (4)). If block 9 is now moved, accessing the corresponding tile will look for its contained memory movement block and determine that it has been moved to segment 20(4), so no movement is required. The bounds checking set forth above may be omitted because the correct memory movement has been made at the tile level prior to the actual data element access in the tile. That is to say:
algorithmic_tiles(_tile_index).
generating any necessary memory moves, then:
algorithmic_tiles(_tile_index)(_local_index)
can be accessed without performing a boundary check on each local _ index. For example, in the above calculations, there are two algorithmic tasks for each memory movement block.
In practice, all computations performed on a given memory block (e.g., C) on the owner memory may be grouped into one large task and the first statement in the task may be a single memory move indication. In one embodiment, the indication may be in the form of a standard C + + annotation to the task as follows:
[[move_memory(C,B)]]
void kernel(field<2,double>&C,const field<2,double>&A,const field<2,double>&B);
by using this annotation, the compiler can optimize and interleave memory movement and computation.
The following code provides an overview of the implementation of agile communication operator 12 in one embodiment.
FIG. 4 is a block diagram illustrating an embodiment of a computer system 100 configured to compile and execute data parallel code 10 including agile communication operator 12.
Computer system 100 includes a host 101, host 101 having one or more Processing Elements (PEs) 102 housed in one or more processor packages (not shown), and a memory system 104. The computer system 100 also includes zero or more input/output devices 106, zero or more display devices 108, zero or more peripheral devices 110, and zero or more network devices 112. Computer system 100 also includes a compute engine 120 having one or more DP optimal compute nodes 121, where each DP optimal compute node 121 includes a set of one or more Processing Elements (PEs) 122 and a memory 124 that stores DP executable 138.
The host 101, input/output devices 106, display devices 108, peripheral devices 110, network devices 112, and compute engine 120 communicate using a set of interconnects 114 that include any suitable type, number, and configuration of controllers, buses, interfaces, and/or other wired or wireless connections.
Computer system 100 represents any suitable processing device configured for general or special purpose. Examples of computer system 100 include a server, a personal computer, a laptop computer, a tablet computer, a smart phone, a Personal Digital Assistant (PDA), a mobile phone, and an audio/video device. The components of computer system 100 (i.e., host 101, input/output device 106, display device 108, peripheral device 110, network device 112, interconnect 114, and compute engine 120) may be contained in a common housing (not shown) or in any suitable number of separate housings (not shown).
The processing elements 102 each form execution hardware configured to execute instructions (i.e., software) stored in the memory system 104. The processing elements 102 in each processor package may have the same or different architectures and/or instruction sets. For example, processing elements 102 may include any combination of sequential execution elements, superscalar execution elements, and data parallel execution elements (e.g., GPU execution elements). Each processing element 102 is configured to access and execute instructions stored in the memory system 104. These instructions may include a basic input system (BIOS) or firmware (not shown), an Operating System (OS)132, code 10, compiler 134, GP executable 136, and DP executable 138. Each processing element 102 may execute instructions in conjunction with or in response to information received from input/output devices 106, display devices 108, peripheral devices 110, network devices 112, and/or compute engine 120.
Host 101 boots and executes OS 132. OS 132 includes instructions executable by the processing elements to manage the components of computer system 100 and provide a set of functions that allow programs to access and use the components. In one embodiment, OS 132 is a Windows operating system. In other embodiments, OS 132 is another operating system suitable for use with computer system 100.
When a computer system executes compiler 134 to compile code 10, compiler 134 generates one or more executables — e.g., one or more GP executables 136 and one or more DP executables 138. In other embodiments, compiler 134 may generate one or more GP executables 136 such that each GP executable 136 includes one or more DP executables 138, or may generate one or more DP executables 138 without generating any GP executables 136. GP executable 136 and/or DP executable 138 are generated in response to a call to compiler 134 with data parallel extensions to compile all or selected portions of code 10. The call may be generated by, for example, a programmer or other user of computer system 100 or other code in another computer system (not shown).
GP executable 136 represents a program intended for execution on one or more general purpose processing elements 102, such as a Central Processing Unit (CPU). GP executable 136 includes low level instructions from the instruction set of one or more general purpose processing elements 102.
DP executable 138 represents a data parallel program or algorithm (e.g., shader) that is intended and optimized for execution on one or more Data Parallel (DP) optimal compute nodes 121. In one embodiment, DP executable 138 comprises DP byte code or some other intermediate representation (IL) of low-level instructions that are converted to an instruction set from DP optimal compute node 121 using a device driver (not shown) before being executed on DP optimal compute node 121. In other embodiments, DP executable 138 comprises low-level instructions from one or more instruction sets of DP optimal compute nodes 121, where the low-level instructions are inserted by compiler 134. Thus, GP executable 136 may be directly executed by one or more general purpose processors (such as CPUs), and DP executable 138 may either be directly executed by one or more DP optimal compute nodes 121 or may be executed by one or more DP optimal compute nodes 121 after being converted to low level instructions for DP optimal compute nodes 121.
Computer system 100 may execute GP executable 136 using one or more processing elements 102, and computer system 100 may execute DP executable 138 using one or more PEs 122 described in more detail below.
The memory system 104 includes any suitable type, number, and configuration of volatile or non-volatile storage devices configured to store instructions and data. The storage devices of memory system 104 represent computer-readable storage media that store computer-executable instructions (i.e., software) including OS 132, code 10, compiler 134, GP executable 136, and DP executable 138. The instructions are executable by computer system 100 to perform the functions and methods of OS 132, code 10, compiler 134, GP executable 136, and DP executable 138 described herein. Memory system 104 stores instructions and data received from processing elements 102, input/output devices 106, display devices 108, peripheral devices 110, network devices 112, and compute engine 120. Memory system 104 provides stored instructions and data to processing elements 102, input/output devices 106, display devices 108, peripheral devices 110, network devices 112, and compute engine 120. Examples of storage devices in memory system 104 include hard disk drives, Random Access Memory (RAM), Read Only Memory (ROM), flash drives and cards, and magnetic and optical disks such as CDs and DVDs.
Input/output devices 106 include any suitable type, number, and configuration of input/output devices configured to input instructions or data from a user to computer system 100 and output instructions or data from computer system 100 to a user. Examples of input/output devices 106 include keyboards, mice, touch pads, touch screens, buttons, dials, knobs, and switches.
Display devices 108 include any suitable type, number, and configuration of display devices configured to output textual and/or graphical information to a user of computer system 100. Examples of display device 108 include a monitor, a display screen, and a projector.
Peripherals 110 include any suitable type, number, and configuration of peripherals configured to operate with one or more other components in computer system 100 to perform general-purpose or special-purpose processing functions.
Network devices 112 include any suitable type, number, and configuration of network devices configured to allow computer system 100 to communicate across one or more networks (not shown). Network device 112 may operate according to any suitable network protocol and/or configuration to allow computer system 100 to send information to, and receive information from, a network.
Compute engine 120 is configured to execute DP executable 138. Compute engine 120 includes one or more compute nodes 121. Each compute node 121 is a collection of computing resources that share a memory hierarchy. Each compute node 121 includes a set of one or more PEs 122 and a memory 124 that stores DP executable 138. PEs 122 execute DP executable 138 and store results generated by DP executable 138 in memory 124. In particular, PE 122 executes DP executable 138 to apply agile communication operator 12 to input indexable type 14 to generate output indexable type 18 as shown in FIG. 4 and described in detail above.
A compute node 121 having one or more compute resources with a hardware architecture optimized for data parallel computing (i.e., executing DP programs or algorithms) is referred to as a DP optimal compute node 121. Examples of DP optimal compute node 121 include node 121 where the set of PEs 122 includes one or more GPUs, and node 121 where the set of PEs 122 includes the set of SIMD units in a general purpose processor package. A compute node 121 that does not have any computational resources with a hardware architecture optimized for data parallel computing (e.g., a processor package with only general purpose processing elements 102) is referred to as a non-DP optimal compute node 121. In each compute node 121, memory 124 may be separate from memory system 104 (such as GPU memory used by a GPU) or may be part of memory system 104 (e.g., memory used by SIMD units in a general purpose processor package).
Host 101 forms a host compute node configured to provide DP executable 138 to compute node 121 for execution and to receive results generated by DP executable 138 using interconnect 114. The host compute node includes a collection of general purpose computing resources (i.e., general purpose processing elements 102) that share a memory hierarchy (i.e., memory system 104). The host compute node may be configured with a symmetric multiprocessing architecture (SMP) and may also be configured to maximize memory locality of the memory system 104 using, for example, a non-uniform memory access (NUMA) architecture.
OS 132 of the host compute node is configured to execute the DP call site to cause DP executable 138 to be executed by the DP optimal compute node or the non-DP optimal compute node 121. In embodiments where memory 124 is separate from memory system 104, the host compute node causes DP executable 138 and the one or more indexable types 14 to be copied from memory system 104 to memory 124. In embodiments where memory system 104 includes memory 124, the host compute node may designate a copy of DP executable 138 and/or one or more indexable types 14 in memory system 104 as memory 124 and/or copy DP executable 138 and/or one or more indexable types 14 from one portion of memory system 104 into another portion of memory system 104 that forms memory 124. The replication process between compute node 121 and the host compute node may be a synchronization point unless it is designated as asynchronous.
The host compute node and each compute node 121 may execute code concurrently independently of each other. The host compute node and each compute node 121 may interact at a synchronization point to coordinate node computations.
In one embodiment, compute engine 120 represents a graphics card in which one or more Graphics Processing Units (GPUs) include PEs 122 and memory 124 separate from memory system 104. In this embodiment, a driver (not shown) of the graphics card may convert byte code or some other intermediate representation (IL) of DP executable 138 into an instruction set of the GPU for execution by PEs 122 of the GPU.
In another embodiment, compute engine 120 is formed from a combination of one or more GPUs (i.e., PEs 122) included in a processor package having one or more general purpose processing elements 102 and a portion of memory system 104 including memory 124. In this embodiment, additional software may be provided on computer system 100 to convert byte code or some other intermediate representation (IL) of DP executable 138 into the instruction set of the GPU in the processor package.
In another embodiment, compute engine 120 is formed from a combination of one or more SIMD units in one or more processor packages that include processing elements 102 and a portion of memory system 104 that includes memory 124. In this embodiment, additional software may be provided on computer system 100 to convert byte code or some other intermediate representation (IL) of DP executable 138 into the instruction set of the SIMD units in the processor package.
In yet another embodiment, compute engine 120 is formed from a combination of one or more scalar or vector processing pipelines in one or more processor packages that include processing elements 102 and a portion of memory system 104 that includes memory 124. In this embodiment, additional software may be provided on computer system 100 to convert byte code or some other intermediate representation (IL) of DP executable 138 into the instruction set of the scalar processing pipeline in the processor package.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific embodiments shown and described without departing from the scope of the present invention. This application is intended to cover any adaptations or variations of the specific embodiments discussed herein. Therefore, it is intended that this invention be limited only by the claims and the equivalents thereof.