Movatterモバイル変換

Parallel external memory

From Wikipedia, the free encyclopedia

In computer science, aparallel external memory (PEM) model is acache-aware, external-memoryabstract machine.^[1] It is the parallel-computing analogy to the single-processorexternal memory (EM) model. In a similar way, it is the cache-aware analogy to theparallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

[edit]

Definition

[edit]

The PEM model^[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of $P {\displaystyle P}$ processors and a two-levelmemory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size $N {\displaystyle N}$ and $P {\displaystyle P}$ small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size $M {\displaystyle M}$ which is partitioned in blocks of size $B {\displaystyle B}$ . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size $B {\displaystyle B}$ .

I/O complexity

[edit]

The complexity measure of the PEM model is the I/O complexity,^[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if $P {\displaystyle P}$ processors load parallelly a data block of size $B {\displaystyle B}$ form the main memory into their caches, it is considered as an I/O complexity of $O(1)$ not $O(P)$ . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

[edit]

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts^[1] occur. Like in the PRAM model, three different variations of this problem are considered:

Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms^[1] solve the CREW and EREW problem if $P\leq B$ processors write to the same block simultaneously.A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of $P {\displaystyle P}$ parallel block transfers. A second approach needs $O(\log(P))$ parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round $P {\displaystyle P}$ processors combine their blocks into $P/2$ blocks. Then $P/2$ processors combine the $P/2$ blocks into $P/4$ . This procedure is continued until all the data is combined in one block.

Comparison to other models

[edit]


Model	Multi-core	Cache-aware
Random-access machine (RAM)	No	No
Parallel random-access machine (PRAM)	Yes	No
External memory (EM)	No	Yes
Parallel external memory (PEM)	Yes	Yes

Examples

[edit]

Multiway partitioning

[edit]

Let $M=\{m_{1},...,m_{d-1}\}$ be a vector of d-1 pivots sorted in increasing order. LetA be an unordered set of N elements. A d-way partition^[1] ofA is a set $\Pi =\{A_{1},...,A_{d}\}$ , where $\cup _{i=1}^{d}A_{i}=A$ and $A_{i}\cap A_{j}=\emptyset$ for $1\leq i<j\leq d$ . $A_{i}$ is called the i-th bucket. The number of elements in $A_{i}$ is greater than $m_{i-1}$ and smaller than $m_{i}^{2}$ . In the following algorithm^[1] the input is partitioned into N/P-sized contiguous segments $S_{1},...,S_{P}$ in main memory. The processor i primarily works on the segment $S_{i}$ . The multiway partitioning algorithm (PEM_DIST_SORT^[1]) uses a PEMprefix sum algorithm^[1] to calculate the prefix sum with the optimal $O\left({\frac {N}{PB}}+\log P\right)$ I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segments $S_{i}$ for each processor iin parallel do    Read the vector of pivotsM into the cache.    Partition $S_{i}$  into d buckets and let vector $M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}$  be the number of items in each bucket.end forRun PEM prefix sum on the set of vectors $\{M_{1},...,M_{P}\}$  simultaneously.// Use the prefix sum vector to compute the final partitionfor each processor iin parallel do    Write elements $S_{i}$  into memory locations offset appropriately by $M_{i-1}$  and $M_{i}$ .end forUsing the prefix sums stored in $M_{P}$  the last processor P calculates the vectorB of bucket sizes and returns it.

If the vector of $d=O\left({\frac {M}{B}}\right)$ pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with $O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)$ I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

[edit]

Theselection problem is about finding the k-th smallest item in an unordered listA of sizeN.The following code^[1] makes use ofPRAMSORT which is a PRAM optimal sorting algorithm which runs in $O(\log N)$ , andSELECT, which is a cache optimal single-processor selection algorithm.

if $N\leq P$ then ${\texttt {PRAMSORT}}(A,P)$ return $A[k]$ end if //Find median of each $S_{i}$ for each processoriin parallel do $m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})$ end for // Sort medians ${\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)$ // Partition around median of medians $t={\texttt {PEMPARTITION}}(A,m_{P/2},P)$ if $k\leq t$ thenreturn ${\texttt {PEMSELECT}}(A[1:t],P,k)$ elsereturn ${\texttt {PEMSELECT}}(A[t+1:N],P,k-t)$ end if

Under the assumption that the input is stored in contiguous memory,PEMSELECT has an I/O complexity of:

O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)

Distribution sort

[edit]

Distribution sort partitions an input listA of sizeN intod disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If $P=1$ the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm^[1] is used:

// Sample ${\tfrac {4N}{\sqrt {d}}}$  elements fromAforeach processoriin parallel doif $M<|S_{i}|$ then $d=M/B$         Load $S_{i}$  inM-sized pages and sort pages individuallyelse $d=|S_{i}|$         Load and sort $S_{i}$  as single pageend if    Pick every ${\sqrt {d}}/4$ 'th element from each sorted memory page into contiguous vector $R^{i}$  of samplesend forin parallel do    Combine vectors $R^{1}\dots R^{P}$  into a single contiguous vector ${\mathcal {R}}$     Make ${\sqrt {d}}$  copies of ${\mathcal {R}}$ : ${\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}$ end do// Find ${\sqrt {d}}$  pivots ${\mathcal {M}}[j]$ for $j=1$  to ${\sqrt {d}}$ in parallel do ${\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})$ end forPack pivots in contiguous array ${\mathcal {M}}$ // PartitionAaround pivots into buckets ${\mathcal {B}}$  ${\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)$ // Recursively sort bucketsfor $j=1$  to ${\sqrt {d}}+1$ in parallel do    recursively call ${\texttt {PEMDISTSORT}}$  on bucketjof size ${\mathcal {B}}[j]$     using $O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)$  processors responsible for elements in bucketjend for

The I/O complexity ofPEMDISTSORT is:

O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)

where

f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)

If the number of processors is chosen that $f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)$ and $M<B^{O(1)}$ the I/O complexity is then:

$O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)$

Other PEM algorithms

[edit]


PEM Algorithm	I/O complexity	Constraints
Mergesort^[1]	$O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
List ranking^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Euler tour^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
Expression tree evaluation^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Finding aMST^[2]	$O\left({\textrm {sort}}_{P}(\|V\|)+{\textrm {sort}}_{P}(\|E\|)\log {\tfrac {\|V\|}{pB}}\right)$	$p\leq {\frac {\|V\|+\|E\|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$

Where ${\textrm {sort}}_{P}(N)$ is the time it takes to sortN items withP processors in the PEM model.

References

[edit]

^^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^lArge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors".Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206.doi:10.1145/1378533.1378573.ISBN 9781595939739.S2CID 11067041.
^^a ^b ^c ^dArge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms".2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11.doi:10.1109/ipdps.2010.5470440.ISBN 9781424464425.S2CID 587572.

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing