Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Parallel external memory

From Wikipedia, the free encyclopedia
PEM Model

In computer science, aparallel external memory (PEM) model is acache-aware, external-memoryabstract machine.[1] It is the parallel-computing analogy to the single-processorexternal memory (EM) model. In a similar way, it is the cache-aware analogy to theparallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

[edit]

Definition

[edit]

The PEM model[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists ofP{\displaystyle P} processors and a two-levelmemory hierarchy. This memory hierarchy consists of a large external memory (main memory) of sizeN{\displaystyle N} andP{\displaystyle P} small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a sizeM{\displaystyle M} which is partitioned in blocks of sizeB{\displaystyle B}. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of sizeB{\displaystyle B}.

I/O complexity

[edit]

The complexity measure of the PEM model is the I/O complexity,[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So ifP{\displaystyle P} processors load parallelly a data block of sizeB{\displaystyle B} form the main memory into their caches, it is considered as an I/O complexity ofO(1){\displaystyle O(1)} notO(P){\displaystyle O(P)}. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

[edit]

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[1] occur. Like in the PRAM model, three different variations of this problem are considered:

  • Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
  • Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
  • Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms[1] solve the CREW and EREW problem ifPB{\displaystyle P\leq B} processors write to the same block simultaneously.A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total ofP{\displaystyle P} parallel block transfers. A second approach needsO(log(P)){\displaystyle O(\log(P))} parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first roundP{\displaystyle P} processors combine their blocks intoP/2{\displaystyle P/2} blocks. ThenP/2{\displaystyle P/2} processors combine theP/2{\displaystyle P/2} blocks intoP/4{\displaystyle P/4}. This procedure is continued until all the data is combined in one block.

Comparison to other models

[edit]
ModelMulti-coreCache-aware
Random-access machine (RAM)NoNo
Parallel random-access machine (PRAM)YesNo
External memory (EM)NoYes
Parallel external memory (PEM)YesYes

Examples

[edit]

Multiway partitioning

[edit]

LetM={m1,...,md1}{\displaystyle M=\{m_{1},...,m_{d-1}\}} be a vector of d-1 pivots sorted in increasing order. LetA be an unordered set of N elements. A d-way partition[1] ofA is a setΠ={A1,...,Ad}{\displaystyle \Pi =\{A_{1},...,A_{d}\}} , wherei=1dAi=A{\displaystyle \cup _{i=1}^{d}A_{i}=A} andAiAj={\displaystyle A_{i}\cap A_{j}=\emptyset } for1i<jd{\displaystyle 1\leq i<j\leq d}.Ai{\displaystyle A_{i}} is called the i-th bucket. The number of elements inAi{\displaystyle A_{i}} is greater thanmi1{\displaystyle m_{i-1}} and smaller thanmi2{\displaystyle m_{i}^{2}}. In the following algorithm[1] the input is partitioned into N/P-sized contiguous segmentsS1,...,SP{\displaystyle S_{1},...,S_{P}} in main memory. The processor i primarily works on the segmentSi{\displaystyle S_{i}}. The multiway partitioning algorithm (PEM_DIST_SORT[1]) uses a PEMprefix sum algorithm[1] to calculate the prefix sum with the optimalO(NPB+logP){\displaystyle O\left({\frac {N}{PB}}+\log P\right)} I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segmentsSi{\displaystyle S_{i}}for each processor iin parallel do    Read the vector of pivotsM into the cache.    PartitionSi{\displaystyle S_{i}} into d buckets and let vectorMi={j1i,...,jdi}{\displaystyle M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}} be the number of items in each bucket.end forRun PEM prefix sum on the set of vectors{M1,...,MP}{\displaystyle \{M_{1},...,M_{P}\}} simultaneously.// Use the prefix sum vector to compute the final partitionfor each processor iin parallel do    Write elementsSi{\displaystyle S_{i}} into memory locations offset appropriately byMi1{\displaystyle M_{i-1}} andMi{\displaystyle M_{i}}.end forUsing the prefix sums stored inMP{\displaystyle M_{P}} the last processor P calculates the vectorB of bucket sizes and returns it.

If the vector ofd=O(MB){\displaystyle d=O\left({\frac {M}{B}}\right)} pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model withO(NPB+dB>log(P)+dlog(B)){\displaystyle O\left({\frac {N}{PB}}+\left\lceil {\frac {d}{B}}\right\rceil >\log(P)+d\log(B)\right)} I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

[edit]

Theselection problem is about finding the k-th smallest item in an unordered listA of sizeN.The following code[1] makes use ofPRAMSORT which is a PRAM optimal sorting algorithm which runs inO(logN){\displaystyle O(\log N)}, andSELECT, which is a cache optimal single-processor selection algorithm.

ifNP{\displaystyle N\leq P}thenPRAMSORT(A,P){\displaystyle {\texttt {PRAMSORT}}(A,P)}returnA[k]{\displaystyle A[k]}end if //Find median of eachSi{\displaystyle S_{i}}for each processoriin parallel domi=SELECT(Si,N2P){\displaystyle m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})}end for // Sort mediansPRAMSORT({m1,,m2},P){\displaystyle {\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)}// Partition around median of medianst=PEMPARTITION(A,mP/2,P){\displaystyle t={\texttt {PEMPARTITION}}(A,m_{P/2},P)}ifkt{\displaystyle k\leq t}thenreturnPEMSELECT(A[1:t],P,k){\displaystyle {\texttt {PEMSELECT}}(A[1:t],P,k)}elsereturnPEMSELECT(A[t+1:N],P,kt){\displaystyle {\texttt {PEMSELECT}}(A[t+1:N],P,k-t)}end if

Under the assumption that the input is stored in contiguous memory,PEMSELECT has an I/O complexity of:

O(NPB+log(PB)log(NP)){\displaystyle O\left({\frac {N}{PB}}+\log(PB)\cdot \log({\frac {N}{P}})\right)}

Distribution sort

[edit]

Distribution sort partitions an input listA of sizeN intod disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

IfP=1{\displaystyle P=1} the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm[1] is used:

// Sample4Nd{\displaystyle {\tfrac {4N}{\sqrt {d}}}} elements fromAforeach processoriin parallel doifM<|Si|{\displaystyle M<|S_{i}|}thend=M/B{\displaystyle d=M/B}        LoadSi{\displaystyle S_{i}} inM-sized pages and sort pages individuallyelsed=|Si|{\displaystyle d=|S_{i}|}        Load and sortSi{\displaystyle S_{i}} as single pageend if    Pick everyd/4{\displaystyle {\sqrt {d}}/4}'th element from each sorted memory page into contiguous vectorRi{\displaystyle R^{i}} of samplesend forin parallel do    Combine vectorsR1RP{\displaystyle R^{1}\dots R^{P}} into a single contiguous vectorR{\displaystyle {\mathcal {R}}}    Maked{\displaystyle {\sqrt {d}}} copies ofR{\displaystyle {\mathcal {R}}}:R1Rd{\displaystyle {\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}}end do// Findd{\displaystyle {\sqrt {d}}} pivotsM[j]{\displaystyle {\mathcal {M}}[j]}forj=1{\displaystyle j=1} tod{\displaystyle {\sqrt {d}}}in parallel doM[j]=PEMSELECT(Ri,Pd,j4Nd){\displaystyle {\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})}end forPack pivots in contiguous arrayM{\displaystyle {\mathcal {M}}}// PartitionAaround pivots into bucketsB{\displaystyle {\mathcal {B}}}B=PEMMULTIPARTITION(A[1:N],M,d,P){\displaystyle {\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)}// Recursively sort bucketsforj=1{\displaystyle j=1} tod+1{\displaystyle {\sqrt {d}}+1}in parallel do    recursively callPEMDISTSORT{\displaystyle {\texttt {PEMDISTSORT}}} on bucketjof sizeB[j]{\displaystyle {\mathcal {B}}[j]}    usingO(B[j]N/P){\displaystyle O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)} processors responsible for elements in bucketjend for

The I/O complexity ofPEMDISTSORT is:

O(NPB(logdP+logM/BNPB)+f(N,P,d)logdP){\displaystyle O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)}

where

f(N,P,d)=O(logPBdlogNP+dBlogP+dlogB){\displaystyle f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)}

If the number of processors is chosen thatf(N,P,d)=O(NPB){\displaystyle f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)}andM<BO(1){\displaystyle M<B^{O(1)}} the I/O complexity is then:

O(NPBlogM/BNB){\displaystyle O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)}

Other PEM algorithms

[edit]
PEM AlgorithmI/O complexityConstraints
Mergesort[1]O(NPBlogMBNB)=sortP(N){\displaystyle O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)}PNB2,M=BO(1){\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
List ranking[2]O(sortP(N)){\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}PN/B2logBlogO(1)N,M=BO(1){\displaystyle P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Euler tour[2]O(sortP(N)){\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}PNB2,M=BO(1){\displaystyle P\leq {\frac {N}{B^{2}}},M=B^{O(1)}}
Expression tree evaluation[2]O(sortP(N)){\displaystyle O\left({\textrm {sort}}_{P}(N)\right)}PNB2logBlogO(1)N,M=BO(1){\displaystyle P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}
Finding aMST[2]O(sortP(|V|)+sortP(|E|)log|V|pB){\displaystyle O\left({\textrm {sort}}_{P}(|V|)+{\textrm {sort}}_{P}(|E|)\log {\tfrac {|V|}{pB}}\right)}p|V|+|E|B2logBlogO(1)N,M=BO(1){\displaystyle p\leq {\frac {|V|+|E|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}}

WheresortP(N){\displaystyle {\textrm {sort}}_{P}(N)} is the time it takes to sortN items withP processors in the PEM model.

See also

[edit]

References

[edit]
  1. ^abcdefghijklArge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors".Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206.doi:10.1145/1378533.1378573.ISBN 9781595939739.S2CID 11067041.
  2. ^abcdArge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms".2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11.doi:10.1109/ipdps.2010.5470440.ISBN 9781424464425.S2CID 587572.
General
Levels
Multithreading
Theory
Elements
Coordination
Programming
Hardware
APIs
Problems
Retrieved from "https://en.wikipedia.org/w/index.php?title=Parallel_external_memory&oldid=1180384636"
Categories:

[8]ページ先頭

©2009-2025 Movatter.jp