- Notifications
You must be signed in to change notification settings - Fork0
Simple benchmark for Fortran coarray operations
License
davidhenty/cafbench
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
!----------------------------------------------------------------------------!! !! Fortran Coarray Micro-Benchmark Suite - Version 1.0 !! !! David Henty, EPCC; d.henty@epcc.ed.ac.uk !! !! Copyright 2013 the University of Edinburgh !! !! Licensed under the Apache License, Version 2.0 (the "License"); !! you may not use this file except in compliance with the License. !! You may obtain a copy of the License at !! !! http://www.apache.org/licenses/LICENSE-2.0 !! !! Unless required by applicable law or agreed to in writing, software !! distributed under the License is distributed on an "AS IS" BASIS, !! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. !! See the License for the specific language governing permissions and !! limitations under the License. !! !!----------------------------------------------------------------------------!This software is released under the license in "LICENSE.txt".
D. Henty, "Performance of Fortran Coarrays on the Cray XE6", inProceedings of Cray User Group 2012.https://cug.org/proceedings/attendee_program_cug2012/includes/files/pap181.pdf
D. Henty, "A Parallel Benchmark Suite for Fortran Coarrays",Applications, Tools and Techniques on the Road t* Exascale Computing,(IOS Press, 2012), pp. 281-288.
This set of benchmarks aims to measure the performance of variousparallel operations involving Fortran coarrays. These includepoint-to-point ("ping-pong") data transfer patterns, synchronisationpatterns and halo-swapping for 3D arrays.
Unpack the tar file.
Select the required benchmarks by editing "cafparams.f90".
Compile using "make".
The supplied Makefile is configured for the Cray compiler - you willhave to set "FC", "FFLAGS". "LDFLAGS" and "LIBS" appropriately for adifferent compiler.
Note that the benchmark can use MPI if required. Currently this isused in the synchronisation section (MPI_Barrier), the pt2pt section(MPI_Send) and to find out the name of the node on which each image isrunning (MPI_Get_processor_name).
If you do not want to use MPI then specify -UMPI in the Makefile.
The executable "cafbench" runs stand-alone without any flags or inputfiles. You will have to launch it as appropriate on your parallelsystem, eg on a Cray: "aprun -n ./cafbench".
The benchmark has three separate sections:
Point-to-point reports the latency and bandwidth (including anysynchronisation overheads).
Synchronisation reports the overhead by performing calculations withand without synchronisation and subtracting the two times.
Halo reports the time and bandwidth for regular halo swapping in a3D pattern.
In all cases the basic data types are double precision numbers.
Which of these are run is controlled by the values of p2pbench,syncbench and halobench in cafparams.f90.
Currently there is no way easily to select which of thepoint-to-point benchmarks are executed: either they are all executedor none at all. However, it is straightforward to editcafpt2ptdriver.f90 only to run selected cases if required.
The point-to-point benchmarks use both remote read and remote write.
All data patterns are characterised by three parameters: count, blksizeand stride. The data transferred is "count" separate blocks each of size"blksize", separated by "stride". We also print out "ndata" (the amountof data actually sent, ie count*blksize) and "nextent" which is thedistance between the first and last data items (which is larger than"ndata" for strided patterns). All data arrays contain double precisionnumbers.
The same pattern may often be realised in several different ways (eginline or via a subroutine) to test the robustness of the compiler. Thismight seem unnecessary, but in the early compiler releases seeminglysimilar expressions have given very different performance, eg x(1:ndata)was much slower than x(:).
The word "many" indicates more than one remote operation or more thanone call to a subroutine (subroutines are indicated by "sub"), althougha good compiler may merge these in a single operation.
Synchronisation is included in the timings. Many repetitions are done toget sensible results and the sync is also done many times.Synchronisation can be global (sync all) or point-to-point (syncimages).
Pingpongs are done in three ways:
Single ping-pong between images 1 and numimages. All other images areidle (except that they must callsync all for global synchronisation).
Multiple ping-pong whereall images are active. Imagei is pairedwith imagei+numimages/2; note this only takes place for even numbersof images greater than 2. This can give significantly differentbandwidths depending on the choice of synchronisation (seecrossingbelow).
Multiple crossing ping-pong which is as above but every other pairswaps in the opposite direction, ie if image 1 is sending to1+numimages/2, then image 2 is receiving from 2+numimages/2. Thisensures that we exploit the bidirectional bandwidth. Note that inpractice this is the same asmutiple if you use point-to-pointsynchronisation: in that case the pairs of images naturally get out ofsync as they contend for bandwidth. However, for global synchronisationthis realises a different pattern frommultiple. This test is reallyto explain any differences inmultiple for different choices ofsynchronisation.
The patterns are as follows - all exceptMPI Send are replicated forget (remote read):
put: Contiguous put done inline: x(1:ndata)[image2] = x(1:ndata)
subput: Contiguous put done via a subroutine with target = source = x anddisp = 1, count = ndata, ie:target(disp:disp+count-1)[image] = source(disp:disp+count-1)
simple subput: As above but with simpler arguments to subroutine:target(:)[image] = source(:)
all put: Arrays are allocated to be of size ndata and simple call is doneinline. This is likesimple subput above except there the arraysare implicitly resized via a subroutine call. Code is:x(:)[image2] = x(:)
many put: A contiguous put done as many separate puts of different blksize:do i = 1, countx(1+(i-1)blksize:iblksize)[image2] = &x(1+(i-1)blksize:iblksize)
sub manyput: Exactly asmany put but done in a separate subroutine:do i = 1, counttarget(disp+(i-1)blksize:disp+iblksize-1)[image] = &source(disp+(i-1)blksize:disp+iblksize-1)
many subput: Same pattern but with many separate invocations ofsubput:do i = 1, countcall cafput(x, x, 1+(i-1)*blksize, blksize, image1)
strided put: Strided data done inline in the code:x(1:nextent:stride)[image2] = x(1:nextent:stride)
strided subput: As above but done via a subroutine:target(istart:istop:stride)[image] = source(istart:istop:stride)
strided many put: The most complex pattern: strided but with blockslarger than a single unit. Pattern is a block of data,followed by a gap of the same size, repeated:
do i = 1, count x(1+2*(i-1)*blksize:(2*i-1)*blksize)[image2] = & x(1+2*(i-1)*blksize:(2*i-1)*blksize)This pattern is a useful measurement in cases where the compilervectorisesmany put into a single put of size ndata.
MPI Send: A regular MPI ping-pong with no coarray synchronisation, done asa sanity check for the coarray performance numbers.
The different synchronisation types are:
sync all: a simple call tosync all.
sync mpi barrier: MPI call for comparison withsync all above.
sync images pairwise: each image callssync images with a singleneighbour; images are paired up in the same pattern as forMultiple ping-pong above.
sync images random: each images callssync images with Nneighbours chosen randomly (to ensure that they all match up Iactually set up a simple ring pattern then randomly permute). N ischosen as 2, 4, 6, ... syncmaxneigh, capped if this starts toexceed the total number of images. Default syncmaxneigh is 12.
sync images ring: each image callssync images with N neighbourspaired as image +/- 1, image +/- 2 ... image +/- syncmaxneigh/2with periodic boundary conditions.
sync images 3d grid: each image callssync images with 6neighbours which are chosen as the up and down neighbours in alldirections in a 3D cartesian grid (with periodic boundaries). Thedimensions of the 3D grid are selected via a call to MPI_Cart_dims(suitably reversed for Fortran indexing). This is precisely thesynchronisation pattern used in the subsequent halo benchmark.
sync lock: all images lock a variable on image 1.
sync critical: all images execute a critical region.
Note that in all of these the time for some computation (a simpledelay loop) is compared to the time for the computation plussynchronisation, and these are subtracted to get the synchronisationoverhead.
About
Simple benchmark for Fortran coarray operations
Resources
License
Uh oh!
There was an error while loading.Please reload this page.