Movatterモバイル変換

Parallel Computing Architecture &Programming TechniquesRaul Goycoolea S.Solution Architect ManagerOracle Enterprise Architecture Group

<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 216 February 2012

Antecedents ofParallelComputing

The “Software Crisis”“To put it quite bluntly: as long as there were nomachines, programming was no problem at all; whenwe had a few weak computers, programming became amild problem, and now we have gigantic computers,programming has become an equally gigantic problem."-- E. Dijkstra, 1972 Turing Award LectureRaul Goycoolea S.Multiprocessor Programming 416 February 2012

The First Software Crisis• Time Frame: ’60s and ’70s• Problem: Assembly Language ProgrammingComputers could handle larger more complex programs• Needed to get Abstraction and Portability withoutlosing PerformanceRaul Goycoolea S.Multiprocessor Programming 516 February 2012

Common PropertiesSingle flow of controlSingle memory imageDifferences:Register FileISAFunctional UnitsHow Did We Solve The First SoftwareCrisis?• High-level languages for von-Neumann machinesFORTRAN and C• Provided “common machine language” foruniprocessorsRaul Goycoolea S.Multiprocessor Programming 616 February 2012

The Second Software Crisis• Time Frame: ’80s and ’90s• Problem: Inability to build and maintain complex androbust applications requiring multi-million lines ofcode developed by hundreds of programmersComputers could handle larger more complex programs• Needed to get Composability, Malleability andMaintainabilityHigh-performance was not an issue left for Moore’s LawRaul Goycoolea S.Multiprocessor Programming 716 February 2012

How Did We Solve the SecondSoftware Crisis?• Object Oriented ProgrammingC++, C# and Java• Also…Better tools• Component libraries, PurifyBetter software engineering methodology• Design patterns, specification, testing, codereviewsRaul Goycoolea S.Multiprocessor Programming 816 February 2012

Today:Programmers are Oblivious to Processors• Solid boundary between Hardware and Software• Programmers don’t have to know anything about theprocessorHigh level languages abstract away the processorsEx: Java bytecode is machine independentMoore’s law does not require the programmers to know anythingabout the processors to get good speedups• Programs are oblivious of the processor works on allprocessorsA program written in ’70 using C still works and is much faster today• This abstraction provides a lot of freedom for theprogrammersRaul Goycoolea S.Multiprocessor Programming 916 February 2012

The Origins of a Third Crisis• Time Frame: 2005 to 20??• Problem: Sequential performance is left behind byMoore’s law• Needed continuous and reasonable performanceimprovementsto support new featuresto support larger datasets• While sustaining portability, malleability andmaintainability without unduly increasing complexityfaced by the programmer critical to keep-up with thecurrent rate of evolution in softwareRaul Goycoolea S.Multiprocessor Programming 1016 February 2012

Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 1116 February 2012

Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 1216 February 2012Intel 386Intel 486

The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 1316 February 2012

Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 1416 February 2012

Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 1516 February 2012

Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 1616 February 2012

Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 1716 February 2012

PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 1816 February 2012

Diminishing Returns• The ’80s: Superscalar expansion50% per year improvement in performanceTransistors applied to implicit parallelismpipeline processor (10 CPI --> 1 CPI)• The ’90s: The Era of Diminishing ReturnsSqueaking out the last implicit parallelism2-way to 6-way issue, out-of-order issue, branch prediction1 CPI --> 0.5 CPIPerformance below expectations projects delayed & canceled• The ’00s: The Beginning of the Multicore EraThe need for Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 1916 February 2012

Mit Raw16 Cores2002 Intel TanglewoodDual Core IA/64Intel DempseyDual Core XeonIntel Montecito1.7 Billion transistorsDual Core IA/64Intel Pentium D(Smithfield)CancelledIntel Tejas & JayhawkUnicore (4GHz P4)IBM Power 6Dual CoreIBM Power 4 and 5Dual Cores Since 2001Intel Pentium Extreme3.2GHz Dual CoreIntel YonahDual Core MobileAMD OpteronDual CoreSun Olympus and Niagara8 Processor CoresIBM CellScalable Multicore… 1H 2005 1H 2006 2H 20062H 20052H 2004Unicores are on extinctionNow all is multicore

# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 2116 February 2012

Introduction toParallelArchitectures

Traditionally, software has been written for serial computation:• To be run on a single computer having a single Central Processing Unit (CPU)• A problem is broken into a discrete series of instructions• Instructions are executed one after another• Only one instruction may execute at any moment in timeWhat is Parallel Computing?Raul Goycoolea S.Multiprocessor Programming 2416 February 2012

What is Parallel Computing?In the simplest sense, parallel computing is the simultaneous use of multiplecompute resources to solve a computational problem:• To be run using multiple CPUs• A problem is broken into discrete parts that can be solved concurrently• Each part is further broken down to a series of instructions• Instructions from each part execute simultaneously on different CPUsRaul Goycoolea S.Multiprocessor Programming 2516 February 2012

Options in Parallel Computing?The compute resources might be:• A single computer with multiple processors;• An arbitrary number of computers connected by a network;• A combination of both.The computational problem should be able to:• Be broken apart into discrete pieces of work that can be solvedsimultaneously;• Execute multiple program instructions at any moment in time;• Be solved in less time with multiple compute resources than with asingle compute resource.Raul Goycoolea S.Multiprocessor Programming 2616 February 2012

The Real World is Massively Parallel• Parallel computing is an evolution of serial computing thatattempts to emulate what has always been the state ofaffairs in the natural world: many complex, interrelatedevents happening at the same time, yet within a sequence.For example:• Galaxy formation• Planetary movement• Weather and ocean patterns• Tectonic plate drift Rush hour traffic• Automobile assembly line• Building a jet• Ordering a hamburger at the drive through.Raul Goycoolea S.Multiprocessor Programming 2816 February 2012

Architecture ConceptsVon Neumann Architecture• Named after the Hungarian mathematician John von Neumann who first authoredthe general requirements for an electronic computer in his 1945 papers• Since then, virtually all computers have followed this basic design, differing fromearlier computers which were programmed through "hard wiring”• Comprised of four main components:• Memory• Control Unit• Arithmetic Logic Unit• Input/Output• Read/write, random access memory is used to storeboth program instructions and data• Program instructions are coded data which tellthe computer to do something• Data is simply information to be used by theprogram• Control unit fetches instructions/data from memory, decodesthe instructions and then sequentially coordinates operationsto accomplish the programmed task.• Aritmetic Unit performs basic arithmetic operations• Input/Output is the interface to the human operatorRaul Goycoolea S.Multiprocessor Programming 2916 February 2012

Flynn’s Taxonomy• There are different ways to classify parallel computers. One of the morewidely used classifications, in use since 1966, is called Flynn'sTaxonomy.• Flynn's taxonomy distinguishes multi-processor computer architecturesaccording to how they can be classified along the two independentdimensions of Instruction and Data. Each of these dimensions canhave only one of two possible states: Single or Multiple.• The matrix below defines the 4 possible classifications according toFlynn:Raul Goycoolea S.Multiprocessor Programming 3016 February 2012

Single Instruction, Single Data (SISD):• A serial (non-parallel) computer• Single Instruction: Only one instruction stream isbeing acted on by the CPU during any one clockcycle• Single Data: Only one data stream is being usedas input during any one clock cycle• Deterministic execution• This is the oldest and even today, the mostcommon type of computer• Examples: older generation mainframes,minicomputers and workstations; most modernday PCs.Raul Goycoolea S.Multiprocessor Programming 3116 February 2012

Single Instruction, Single Data (SISD):Raul Goycoolea S.Multiprocessor Programming 3216 February 2012

Single Instruction, Multiple Data(SIMD):• A type of parallel computer• Single Instruction: All processing units execute the same instruction at anygiven clock cycle• Multiple Data: Each processing unit can operate on a different data element• Best suited for specialized problems characterized by a high degree ofregularity, such as graphics/image processing.• Synchronous (lockstep) and deterministic execution• Two varieties: Processor Arrays and Vector Pipelines• Examples:• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,ETA10• Most modern computers, particularly those with graphics processor units(GPUs) employ SIMD instructions and execution units.Raul Goycoolea S.Multiprocessor Programming 3316 February 2012

Single Instruction, Multiple Data(SIMD):ILLIAC IV MasPar TM CM-2 Cell GPUCray X-MP Cray Y-MPRaul Goycoolea S.Multiprocessor Programming 3416 February 2012

• A type of parallel computer• Multiple Instruction: Each processing unit operates on the dataindependently via separate instruction streams.• Single Data: A single data stream is fed into multiple processingunits.• Few actual examples of this class of parallel computer have everexisted. One is the experimental Carnegie-Mellon C.mmp computer(1971).• Some conceivable uses might be:• multiple frequency filters operating on a single signal stream• multiple cryptography algorithms attempting to crack a single codedmessage.Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3516 February 2012

Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3616 February 2012

• A type of parallel computer• Multiple Instruction: Every processor may be executing a differentinstruction stream• Multiple Data: Every processor may be working with a differentdata stream• Execution can be synchronous or asynchronous, deterministic ornon-deterministic• Currently, the most common type of parallel computer - mostmodern supercomputers fall into this category.• Examples: most current supercomputers, networked parallelcomputer clusters and "grids", multi-processor SMP computers,multi-core PCs.Note: many MIMD architectures also include SIMD execution sub-componentsMultiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3716 February 2012

Multiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3816 February 2012

Multiple Instruction, Multiple Data(MIMD):IBM Power HP Alphaserver Intel IA32/x64Oracle SPARC Cray XT3 Oracle Exadata/ExalogicRaul Goycoolea S.Multiprocessor Programming 3916 February 2012

Parallel Computer Memory ArchitectureShared MemoryShared memory parallel computers vary widely, but generally have in common theability for all processors to access all memory as global address space.Multiple processors can operate independently but share the same memoryresources.Changes in a memory location effected by one processor are visible to all otherprocessors.Shared memory machines can be divided into two main classes based uponmemory access times: UMA and NUMA.Uniform Memory Access (UMA):• Most commonly represented today by Symmetric Multiprocessor (SMP) machines• Identical processorsNon-Uniform Memory Access (NUMA):• Often made by physically linking two or more SMPs• One SMP can directly access memory of another SMP40Raul Goycoolea S.Multiprocessor Programming 4016 February 2012

Parallel Computer Memory ArchitectureShared Memory41Shared Memory (UMA) Shared Memory (NUMA)Raul Goycoolea S.Multiprocessor Programming 4116 February 2012

Basic structure of a centralizedshared-memory multiprocessorProcessor Processor Processor ProcessorOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheMultiple processor-cache subsystems share the same physical memory, typically connected by a bus.In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniformaccess time o all memory from all processors remains.Raul Goycoolea S.Multiprocessor Programming 4216 February 2012

Processor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryInterconnection NetworkBasic Architecture of a DistributedMultiprocessorConsists of individual nodes containing a processor, some memory, typically some I/O, and an interface to aninterconnection network that connects all the nodes. Individual nodes may contain a small number ofprocessors, which may be interconnected by a small bus or a different interconnection technology, which is lessscalable than the global interconnection network.Raul Goycoolea S.Multiprocessor Programming 4316 February 2012

Communicationhow do parallel operations communicate data results?Synchronizationhow are parallel operations coordinated?Resource Managementhow are a large number of parallel tasks scheduled ontofinite hardware?Scalabilityhow large a machine can be built?Issues in Parallel Machine DesignRaul Goycoolea S.Multiprocessor Programming 4416 February 2012

ExplicitImplicitHardware CompilerSuperscalarProcessorsExplicitly Parallel ArchitecturesImplicit vs. Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 4716 February 2012

Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 4816 February 2012

Issue varying numbers of instructions per clockstatically scheduled––using compiler techniquesin-order executiondynamically scheduled–––––Extracting ILP by examining 100‟s of instructionsScheduling them in parallel as operands become availableRename registers to eliminate anti dependencesout-of-order executionSpeculative executionImplicit Parallelism: SuperscalarProcessorsRaul Goycoolea S.Multiprocessor Programming 4916 February 2012

Instruction i IF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBInstruction i+1Instruction i+2Instruction i+3Instruction i+4Instruction # 1 2 3 4 5 6 7 8IF: Instruction fetchEX : ExecutionCyclesID : Instruction decodeWB : Write backPipelining ExecutionRaul Goycoolea S.Multiprocessor Programming 5016 February 2012

Instruction type 1 2 3 4 5 6 7CyclesIntegerFloating pointIFIFIDIDEXEXWBWBIntegerFloating pointIntegerFloating pointIntegerFloating pointIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWB2-issue super-scalar machineSuper-Scalar ExecutionRaul Goycoolea S.Multiprocessor Programming 5116 February 2012

Intrinsic data dependent (aka true dependence) on Instructions:I: add r1,r2,r3J: sub r4,r1,r3If two instructions are data dependent, they cannot executesimultaneously, be completely overlapped or execute in out-of-orderIf data dependence caused a hazard in pipeline,called a Read After Write (RAW) hazardData Dependence and HazardsRaul Goycoolea S.Multiprocessor Programming 5216 February 2012

HW/SW must preserve program order:order instructions would execute in if executed sequentially asdetermined by original source programDependences are a property of programsImportance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possiblybe exploitedGoal: exploit parallelism by preserving program order onlywhere it affects the outcome of the programILP and Data Dependencies, HazardsRaul Goycoolea S.Multiprocessor Programming 5316 February 2012

Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data betweenthe instructions associated with that name; 2 versions ofname dependenceInstrJ writes operand before InstrIreads itI: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Read (WAR) hazardName Dependence #1: Anti-dependeceRaul Goycoolea S.Multiprocessor Programming 5416 February 2012

Instruction writes operand before InstrIwrites it.I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “output dependence” by compiler writers.This also results from the reuse of name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Write (WAW) hazardInstructions involved in a name dependence can executesimultaneously if name used in instructions is changed soinstructions do not conflictRegister renaming resolves name dependence for registersRenaming can be done either by compiler or by HWName Dependence #1: OutputDependenceRaul Goycoolea S.Multiprocessor Programming 5516 February 2012

Every instruction is control dependent on some set ofbranches, and, in general, these control dependencies mustbe preserved to preserve program orderif p1 {S1;};if p2 {S2;}S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.Control dependence need not be preservedwilling to execute instructions that should not have beenexecuted, thereby violating the control dependences, if cando so without affecting correctness of the programSpeculative ExecutionControl DependenciesRaul Goycoolea S.Multiprocessor Programming 5616 February 2012

Greater ILP: Overcome control dependence by hardwarespeculating on outcome of branches and executingprogram as if guesses were correctSpeculation ⇒ fetch, issue, and executeinstructions as if branch predictions were alwayscorrectDynamic scheduling ⇒ only fetches and issuesinstructionsEssentially a data flow execution model: Operationsexecute as soon as their operands are availableSpeculationRaul Goycoolea S.Multiprocessor Programming 5716 February 2012

Different predictorsBranch PredictionValue PredictionPrefetching (memory access pattern prediction)InefficientPredictions can go wrongHas to flush out wrongly predicted dataWhile not impacting performance, it consumes powerSpeculation in Rampant in ModernSuperscalarsRaul Goycoolea S.Multiprocessor Programming 5816 February 2012

Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 5916 February 2012

Parallelism is exposed to softwareCompiler or ProgrammerMany different formsLoosely coupled Multiprocessors to tightly coupled VLIWExplicit Parallel ProcessorsRaul Goycoolea S.Multiprocessor Programming 6016 February 2012

Throughput per CycleOne OperationLatency in CyclesParallelism = Throughput * LatencyTo maintain throughput T/cycle when each operation haslatency L cycles, need T*L independent operationsFor fixed parallelism:decreased latency allows increased throughputdecreased throughput allows increased latency toleranceLittle’s LawRaul Goycoolea S.Multiprocessor Programming 6116 February 2012

TimeTimeTimeTimeData-Level Parallelism (DLP)Instruction-Level Parallelism (ILP)PipeliningThread-Level Parallelism (TLP)Types of Software ParallelismRaul Goycoolea S.Multiprocessor Programming 6216 February 2012

PipeliningThreadParallelDataParallelInstructionParallelTranslating Parallelism TypesRaul Goycoolea S.Multiprocessor Programming 6316 February 2012

What is a sequential program?A single thread of control that executes one instruction and when it isfinished execute the next logical instructionWhat is a concurrent program?A collection of autonomous sequential threads, executing (logically) inparallelThe implementation (i.e. execution) of a collection of threads can be:Multiprogramming– Threads multiplex their executions on a single processor.Multiprocessing– Threads multiplex their executions on a multiprocessor or a multicore systemDistributed Processing– Processes multiplex their executions on several different machinesWhat is concurrency?Raul Goycoolea S.Multiprocessor Programming 6416 February 2012

Concurrency is not (only) parallelismInterleaved ConcurrencyLogically simultaneous processingInterleaved execution on a singleprocessorParallelismPhysically simultaneous processingRequires a multiprocessors or amulticore systemTimeTimeABCABCConcurrency and ParallelismRaul Goycoolea S.Multiprocessor Programming 6516 February 2012

There are a lot of ways to use Concurrency inProgrammingSemaphoresBlocking & non-blocking queuesConcurrent hash mapsCopy-on-write arraysExchangersBarriersFuturesThread pool supportOther Types of SynchronizationRaul Goycoolea S.Multiprocessor Programming 6616 February 2012

DeadlockTwo or more threads stop and wait for each otherLivelockTwo or more threads continue to execute, but makeno progress toward the ultimate goalStarvationSome thread gets deferred foreverLack of fairnessEach thread gets a turn to make progressRace ConditionSome possible interleaving of threads results in anundesired computation resultPotential Concurrency ProblemsRaul Goycoolea S.Multiprocessor Programming 6716 February 2012

Concurrency and Parallelism are important conceptsin Computer ScienceConcurrency can simplify programmingHowever it can be very hard to understand and debugconcurrent programsParallelism is critical for high performanceFrom Supercomputers in national labs toMulticores and GPUs on your desktopConcurrency is the basis for writing parallel programsNext Lecture: How to write a Parallel ProgramParallelism ConclusionsRaul Goycoolea S.Multiprocessor Programming 6816 February 2012

Shared memory––––Ex: Intel Core 2 Duo/QuadOne copy of data sharedamong many coresAtomicity, locking andsynchronizationessential for correctnessMany scalability issuesDistributed memory––––Ex: CellCores primarily access localmemoryExplicit data exchangebetween coresData distribution andcommunication orchestrationis essential for performanceP1 P2 P3 PnMemoryInterconnection NetworkInterconnection NetworkP1 P2 P3 PnM1 M2 M3 MnTwo primary patterns of multicore architecture designArchitecture RecapRaul Goycoolea S.Multiprocessor Programming 6916 February 2012

Processor 1…n ask for XThere is only one place to lookCommunication throughshared variablesRace conditions possibleUse synchronization to protect from conflictsChange how data is stored to minimize synchronizationP1 P2 P3 PnMemoryxInterconnection NetworkProgramming Shared Memory ProcessorsRaul Goycoolea S.Multiprocessor Programming 7016 February 2012

Data parallelPerform same computationbut operate on different dataA single process can forkmultiple concurrent threadsEach thread encapsulate its own execution pathEach thread has local state and shared resourcesThreads communicate through shared resourcessuch as global memoryfor (i = 0; i < 12; i++)C[i] = A[i] + B[i];i=0i=1i=2i=3i=8i=9i = 10i = 11i=4i=5i=6i=7join (barrier)fork (threads)Example of ParallelizationRaul Goycoolea S.Multiprocessor Programming 7116 February 2012

int A[12] = {...}; int B[12] = {...}; int C[12];void add_arrays(int start){int i;for (i = start; i < start + 4; i++)C[i] = A[i] + B[i];}int main (int argc, char *argv[]){pthread_t threads_ids[3];int rc, t;for(t = 0; t < 4; t++) {rc = pthread_create(&thread_ids[t],NULL /* attributes */,add_arrays /* function */,t * 4 /* args to function */);}pthread_exit(NULL);}join (barrier)i=0i=1i=2i=3i=4i=5i=6i=7i=8i=9i = 10i = 11fork (threads)Example Parallelization with ThreadsRaul Goycoolea S.Multiprocessor Programming 7216 February 2012

Data parallelismPerform same computationbut operate on different dataControl parallelismPerform different functionsfork (threads)join (barrier)pthread_create(/* thread id */,/* attributes */,/*/*any functionargs to function*/,*/);Types of ParallelismRaul Goycoolea S.Multiprocessor Programming 7316 February 2012

Uniform Memory Access (UMA)Centrally located memoryAll processors are equidistant (access times)Non-Uniform Access (NUMA)Physically partitioned but accessible by allProcessors have the same address spacePlacement of data affects performanceMemory Access Latency in SharedMemory ArchitecturesRaul Goycoolea S.Multiprocessor Programming 7416 February 2012

Coverage or extent of parallelism in algorithmGranularity of data partitioning among processorsLocality of computation and communication… so how do I parallelize my program?Summary of Parallel PerformanceFactorsRaul Goycoolea S.Multiprocessor Programming 7516 February 2012

P0Tasks Processes ProcessorsP1P2 P3p0 p1p2 p3p0 p1p2 p3PartitioningSequentialcomputationParallelprogramdecompositionassignmentorchestrationmappingCommon Steps to Create a ParallelProgram

Identify concurrency and decide at what level toexploit itBreak up computation into tasks to be dividedamong processesTasks may become available dynamicallyNumber of tasks may vary with timeEnough tasks to keep processors busyNumber of tasks available at a time is upper bound onachievable speedupDecomposition (Amdahl’s Law)

Specify mechanism to divide work among coreBalance work and reduce communicationStructured approaches usually work wellCode inspection or understanding of applicationWell-known design patternsAs programmers, we worry about partitioning firstIndependent of architecture or programming modelBut complexity often affect decisions!Granularity

Computation and communication concurrencyPreserve locality of dataSchedule tasks to satisfy dependences earlyOrchestration and Mapping

Provides a cookbook to systematically guide programmersDecompose, Assign, Orchestrate, MapCan lead to high quality solutions in some domainsProvide common vocabulary to the programming communityEach pattern has a name, providing a vocabulary fordiscussing solutionsHelps with software reusability, malleability, and modularityWritten in prescribed format to allow the reader toquickly understand the solution and its contextOtherwise, too difficult for programmers, and software will notfully exploit parallel hardwareParallel Programming by Pattern

Berkeley architecture professorChristopher AlexanderIn 1977, patterns for cityplanning, landscaping, andarchitecture in an attempt tocapture principles for “living”designHistory

Design Patterns: Elements of Reusable Object-Oriented Software (1995)Gang of Four (GOF): Gamma, Helm, Johnson, VlissidesCatalogue of patternsCreation, structural, behavioralPatterns in Object-OrientedProgramming

Algorithm ExpressionFinding ConcurrencyExpose concurrent tasksAlgorithm StructureMap tasks to processes toexploit parallel architecture4 Design SpacesSoftware ConstructionSupporting StructuresCode and data structuringpatternsImplementation MechanismsLow level mechanisms usedto write parallel programsPatterns for ParallelProgramming. Mattson,Sanders, and Massingill(2005).Patterns for Parallelizing Programs

splitfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationspatially encoded macroblocksdifferentially codedmotion vectorsMotion Vector DecodeRepeatmotion vectorsMPEG bit streamVLDmacroblocks, motion vectorsMPEG DecoderjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?

Task decompositionIndependent coarse-grainedcomputationInherent to algorithmSequence of statements(instructions) that operatetogether as a groupCorresponds to some logicalpart of programUsually follows from the wayprogrammer thinks about aproblemjoinmotion vectorsspatially encoded macroblocksIDCTSaturationMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?

joinmotion vectorsSaturationspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation is appliedto small data chunks derivedfrom large data setHere’s my algorithm, Where’s theconcurrency?

motion vectorsspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation many dataPipeline decompositionData assembly linesProducer-consumer chainsHere’s my algorithm, Where’s theconcurrency?

Algorithms start with a good understanding of theproblem being solvedPrograms often naturally decompose into tasksTwo common decompositions are––Function calls andDistinct loop iterationsEasier to start with many tasks and later fuse them,rather than too few tasks and later try to split themGuidelines for Task Decomposition

FlexibilityProgram design should afford flexibility in the number andsize of tasks generated––Tasks should not tied to a specific architectureFixed tasks vs. Parameterized tasksEfficiencyTasks should have enough work to amortize the cost ofcreating and managing themTasks should be sufficiently independent so that managingdependencies doesn‟t become the bottleneckSimplicityThe code has to remain readable and easy to understand,and debugGuidelines for Task Decomposition

Data decomposition is often implied by taskdecompositionProgrammers need to address task and datadecomposition to create a parallel programWhich decomposition to start with?Data decomposition is a good starting point whenMain computation is organized around manipulation of alarge data structureSimilar operations are applied to different parts of thedata structureGuidelines for Data DecompositionRaul Goycoolea S.Multiprocessor Programming 9316 February 2012

Array data structuresDecomposition of arrays along rows, columns, blocksRecursive data structuresExample: decomposition of trees into sub-treesproblemcomputesubproblemcomputesubproblemcomputesubproblemcomputesubproblemmergesubproblemmergesubproblemmergesolutionsubproblemsplitsubproblemsplitsplitCommon Data DecompositionsRaul Goycoolea S.Multiprocessor Programming 9416 February 2012

FlexibilitySize and number of data chunks should support a widerange of executionsEfficiencyData chunks should generate comparable amounts ofwork (for load balancing)SimplicityComplex data compositions can get difficult to manageand debugRaul Goycoolea S.Multiprocessor Programming 9516 February 2012Guidelines for Data Decompositions

Data is flowing through a sequence of stagesAssembly line is a good analogyWhat’s a prime example of pipeline decomposition incomputer architecture?Instruction pipeline in modern CPUsWhat’s an example pipeline you may use in your UNIX shell?Pipes in UNIX: cat foobar.c | grep bar | wcOther examplesSignal processingGraphicsZigZagIQuantizationIDCTSaturationGuidelines for Pipeline DecompositionRaul Goycoolea S.Multiprocessor Programming 9616 February 2012

Coverage or extent of parallelism in algorithmAmdahl‟s LawGranularity of partitioning among processorsCommunication cost and load balancingLocality of computation and communicationCommunication between processors or betweenprocessors and their memoriesReview: Keys to Parallel Performance

n/mBt overlap)C f (o lfrequencyof messagesoverhead permessage(at both ends)network delayper messagenumber of messagesamount of latencyhidden by concurrencywith computationtotal data sentcost induced bycontention permessagebandwidth along path(determined by network)Communication Cost Model

synchronizationpointGet DataComputeGet DataCPU is idleMemory is idleComputeOverlapping Communication withComputation

Computation to communication ratio limitsperformance gains from pipeliningGet DataComputeGet DataComputeWhere else to look for performance?Limits in Pipelining Communication

Determined by program implementation andinteractions with the architectureExamples:Poor distribution of data across distributed memoriesUnnecessarily fetching data that is not usedRedundant data fetchesArtifactual Communication

In uniprocessors, CPU communicates with memoryLoads and stores are to uniprocessors as_______ and ______ are to distributed memorymultiprocessorsHow is communication overlap enhanced inuniprocessors?Spatial localityTemporal locality“get” “put”Lessons From Uniprocessors

CPU asks for data at address 1000Memory sends data at address 1000 … 1064Amount of data sent depends on architectureparameters such as the cache block sizeWorks well if CPU actually ends up using data from1001, 1002, …, 1064Otherwise wasted bandwidth and cache capacitySpatial Locality

Main memory access is expensiveMemory hierarchy adds small but fast memories(caches) near the CPUMemories get bigger as distancefrom CPU increasesCPU asks for data at address 1000Memory hierarchy anticipates more accesses to sameaddress and stores a local copyWorks well if CPU actually ends up using data from 1000 overand over and over …Otherwise wasted cache capacitymainmemorycache(level 2)cache(level 1)Temporal Locality

Data is transferred in chunks to amortizecommunication costCell: DMA gets up to 16KUsually get a contiguous chunk of memorySpatial localityComputation should exhibit good spatial localitycharacteristicsTemporal localityReorder computation to maximize use of data fetchedReducing Artifactual Costs inDistributed Memory Architectures

Tasks mapped to execution units (threads)Threads run on individual processors (cores)finish line: sequential time + longest parallel timeTwo keys to faster executionLoad balance the work among the processorsMake execution on each processor fastersequentialparallelsequentialparallelSingle Thread Performance

Need some way ofmeasuring performanceCoarse grainedmeasurements% gcc sample.c% time a.out2.312u 0.062s 0:02.50 94.8%% gcc sample.c –O3% time a.out1.921u 0.093s 0:02.03 99.0%… but did we learn muchabout what’s going on?#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {a[i] = 0;}}int main() {double s=0,s2=0; int i,j;for (j = 0; j < T; j++) {for (i = 0; i < N; i++) {b[i] = 0;}cleara(a);memset(a,0,sizeof(a));for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];}}printf("s %f s2 %fn",s,s2);}record stop timerecord start timeUnderstanding Performance

Increasingly possible to get accurate measurementsusing performance countersSpecial registers in the hardware to measure eventsInsert code to start, read, and stop counterMeasure exactly what you want, anywhere you wantCan measure communication and computation durationBut requires manual changesMonitoring nested scopes is an issueHeisenberg effect: counters can perturb execution timetimestopclear/startMeasurements Using Counters

Event-based profilingInterrupt execution when an event counter reaches athresholdTime-based profilingInterrupt execution every t secondsWorks without modifying your codeDoes not require that you know where problem might beSupports multiple languages and programming modelsQuite efficient for appropriate sampling frequenciesDynamic Profiling

Cycles (clock ticks)Pipeline stallsCache hitsCache missesNumber of instructionsNumber of loadsNumber of storesNumber of floating point operations…Counter Examples

Processor utilizationCycles / Wall Clock TimeInstructions per cycleInstructions / CyclesInstructions per memory operationInstructions / Loads + StoresAverage number of instructions per load missInstructions / L1 Load MissesMemory trafficLoads + Stores * Lk Cache Line SizeBandwidth consumedLoads + Stores * Lk Cache Line Size / Wall Clock TimeMany othersCache miss rateBranch misprediction rate…Useful Derived Measurements

applicationsourcerun(profilesexecution)performanceprofilebinaryobject codecompilerbinary analysisinterpret profilesourcecorrelationCommon Profiling Workflow

GNU gprofWidely available with UNIX/Linux distributionsgcc –O2 –pg foo.c –o foo./foogprof fooHPC Toolkithttp://www.hipersoft.rice.edu/hpctoolkit/PAPIhttp://icl.cs.utk.edu/papi/VTunehttp://www.intel.com/cd/software/products/asmo-na/eng/vtune/Many othersPopular Runtime Profiling Tools

Instruction level parallelismMultiple functional units, deeply pipelined, speculation, ...Data level parallelismSIMD (Single Inst, Multiple Data): short vector instructions(multimedia extensions)–––Hardware is simpler, no heavily ported register filesInstructions are more compactReduces instruction fetch bandwidthComplex memory hierarchiesMultiple level caches, may outstanding misses,prefetching, …Performance un Uniprocessorstime = compute + wait

Single Instruction, Multiple DataSIMD registers hold short vectorsInstruction operates on all elements in SIMD register at onceabcVector codefor (int i = 0; i < n; i += 4) {c[i:i+3] = a[i:i+3] + b[i:i+3]}SIMD registerScalar codefor (int i = 0; i < n; i+=1) {c[i] = a[i] + b[i]}abcscalar registerSingle Instruction, Multiple Data

For Example CellSPU has 128 128-bit registersAll instructions are SIMD instructionsRegisters are treated as short vectors of 8/16/32-bitintegers or single/double-precision floatsInstruction SetAltiVecMMX/SSE3DNow!VISMAX2MVIMDMXArchitecturePowerPCIntelAMDSunHPAlphaMIPS VSIMD Width12864/1286464646464Floating PointyesyesyesnononoyesSIMD in Major Instruction SetArchitectures (ISAs)

Library calls and inline assemblyDifficult to programNot portableDifferent extensions to the same ISAMMX and SSESSE vs. 3DNow!Compiler vs. Crypto Oracle T4Using SIMD Instructions

Tune the parallelism firstThen tune performance on individual processorsModern processors are complexNeed instruction level parallelism for performanceUnderstanding performance requires a lot of probingOptimize for the memory hierarchyMemory is much slower than processorsMulti-layer memory hierarchies try to hide the speed gapData locality is essential for performanceProgramming for Performance

May have to change everything!Algorithms, data structures, program structureFocus on the biggest performance impedimentsToo many issues to study everythingRemember the law of diminishing returnsProgramming for Performance

Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationCompilers OutlineRaul Goycoolea S.Multiprocessor Programming 12416 February 2012

Instruction Level Parallelism(ILP)Task Level Parallelism (TLP)Loop Level Parallelism (LLP)or Data ParallelismPipeline ParallelismDivide and ConquerParallelismScheduling and HardwareMainly by handHand or Compiler GeneratedHardware or StreamingRecursive functionsTypes of ParallelismRaul Goycoolea S.Multiprocessor Programming 12516 February 2012

90% of the execution time in 10% of the codeMostly in loopsIf parallel, can get good performanceLoad balancingRelatively easy to analyzeWhy Loops?Raul Goycoolea S.Multiprocessor Programming 12616 February 2012

FORALLNo “loop carrieddependences”Fully parallelFORACROSSSome “loop carrieddependences”Programmer Defined Parallel LoopRaul Goycoolea S.Multiprocessor Programming 12716 February 2012

Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 12816 February 2012

Finding FORALL Loops out of FOR loopsExamplesFOR I = 0 to 5A[I+1] = A[I] + 1FOR I = 0 to 5A[I] = A[I+6] + 1For I = 0 to 5A[2*I] = A[2*I + 1] + 1Parallelizing CompilersRaul Goycoolea S.Multiprocessor Programming 12916 February 2012

True dependencea == aAnti dependence= aa =Output dependenceaa==Definition:Data dependence exists for a dynamic instance i and j iffeither i or j is a write operationi and j refer to the same variablei executes before jHow about array accesses within loops?DependencesRaul Goycoolea S.Multiprocessor Programming 13016 February 2012

FOR I = 0 to 5A[I] = A[I] + 10 1 2Iteration Space0 1 2 3 4 5Data Space3 4 5 6 7 8 9 10 11 12A[I]A[I]A[I]A[I]A[I]= A[I]= A[I]= A[I]= A[I]= A[I]Array Access in a LoopRaul Goycoolea S.Multiprocessor Programming 13216 February 2012

Find data dependences in loopFor every pair of array acceses to the same arrayIf the first access has at least one dynamic instance (an iteration) inwhich it refers to a location in the array that the second access alsorefers to in at least one of the later dynamic instances (iterations).Then there is a data dependence between the statements(Note that same array can refer to itself – output dependences)DefinitionLoop-carried dependence:dependence that crosses a loop boundaryIf there are no loop carried dependences are parallelizableRecognizing FORALL LoopsRaul Goycoolea S.Multiprocessor Programming 13316 February 2012

FOR I = 1 to nFOR J = 1 to nA[I, J] = A[I-1, J+1] + 1FOR I = 1 to nFOR J = 1 to nA[I] = A[I-1] + 1JJIIWhat is the Dependence?Raul Goycoolea S.Multiprocessor Programming 13416 February 2012

Scalar PrivatizationReduction RecognitionInduction Variable IdentificationArray PrivatizationInterprocedural ParallelizationLoop TransformationsGranularity of ParallelismIncreasing ParallelizationOpportunitiesRaul Goycoolea S.Multiprocessor Programming 13616 February 2012

ExampleFOR i = 1 to nX = A[i] * 3;B[i] = X;Is there a loop carried dependence?What is the type of dependence?Scalar PrivatizationRaul Goycoolea S.Multiprocessor Programming 13716 February 2012

Reduction Analysis:Only associative operationsThe result is never used within the loopTransformationInteger Xtmp[NUMPROC];Barrier();FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)Xtmp[myPid] = Xtmp[myPid] + A[i];Barrier();If(myPid == 0) {FOR p = 0 to NUMPROC-1X = X + Xtmp[p];…Reduction RecognitionRaul Goycoolea S.Multiprocessor Programming 13816 February 2012

ExampleFOR i = 0 to NA[i] = 2^i;After strength reductiont = 1FOR i = 0 to NA[i] = t;t = t*2;What happened to loop carried dependences?Need to do opposite of this!Perform induction variable analysisRewrite IVs as a function of the loop variableInduction VariablesRaul Goycoolea S.Multiprocessor Programming 13916 February 2012

Similar to scalar privatizationHowever, analysis is more complexArray Data Dependence Analysis:Checks if two iterations access the same locationArray Data Flow Analysis:Checks if two iterations access the same valueTransformationsSimilar to scalar privatizationPrivate copy for each processor or expand with an additionaldimensionArray PrivatizationRaul Goycoolea S.Multiprocessor Programming 14016 February 2012

Function calls will make a loop unparallelizatbleReduction of available parallelismA lot of inner-loop parallelismSolutionsInterprocedural AnalysisInliningInterprocedural ParallelizationRaul Goycoolea S.Multiprocessor Programming 14116 February 2012

Cache Coherent Shared Memory MachineGenerate code for the parallel loop nestNo Cache Coherent Shared Memoryor Distributed Memory MachinesGenerate code for the parallel loop nestIdentify communicationGenerate communication codeCommunication Code GenerationRaul Goycoolea S.Multiprocessor Programming 14216 February 2012

Eliminating redundant communicationCommunication aggregationMulti-cast identificationLocal memory managementCommunication OptimizationsRaul Goycoolea S.Multiprocessor Programming 14316 February 2012

Automatic parallelization of loops with arraysRequires Data Dependence AnalysisIteration space & data space abstractionAn integer programming problemMany optimizations that’ll increase parallelismTransforming loop nests and communication code generationFourier-Motzkin Elimination provides a nice frameworkSummaryRaul Goycoolea S.Multiprocessor Programming 14416 February 2012

<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 14516 February 2012

Future ofParallelArchitectures

"I think there is a world market formaybe five computers.“– Thomas Watson, chairman of IBM, 1949"There is no reason in the worldanyone would want a computer in theirhome. No reason.”– Ken Olsen, Chairman, DEC, 1977"640K of RAM ought to be enough foranybody.”– Bill Gates, 1981Predicting the Future is Always RiskyRaul Goycoolea S.Multiprocessor Programming 14716 February 2012

EvolutionRelatively easy to predictExtrapolate the trendsRevolutionA completely new technology or solutionHard to PredictParadigm Shifts can occur in bothFuture = Evolution + RevolutionRaul Goycoolea S.Multiprocessor Programming 14816 February 2012

EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 14916 February 2012

Look at the trendsMoore‟s LawPower ConsumptionWire DelayHardware ComplexityParallelizing CompilersProgram Design MethodologiesDesign Drivers are different inDifferent GenerationsEvolutionRaul Goycoolea S.Multiprocessor Programming 15016 February 2012

Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 15116 February 2012

Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 15216 February 2012Intel 386Intel 486

The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 15316 February 2012

Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 15416 February 2012

Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 15516 February 2012

Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 15616 February 2012

Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 15716 February 2012

PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 15816 February 2012

1970 1980 1990 2000 2010Improvement in Automatic ParallelizationAutomaticParallelizingCompilers forFORTRANVectorizationtechnologyPrevalence of typeunsafe languages andcomplex datastructures (C, C++)Typesafelanguages(Java, C#)Demanddriven byMulticores?Compiling forInstructionLevelParallelismRaul Goycoolea S.Multiprocessor Programming 15916 February 2012

# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 16016 February 2012

EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 16116 February 2012

Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses–not just a few supercomputer super-usersNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16216 February 2012

Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses– not just a few supercomputer super-usersNot your same old multiprocessor problemHow does going from Multiprocessors to Multicores impactprograms?What changed?Where is the Impact?––Communication BandwidthCommunication LatencyNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16316 February 2012

How much data can be communicatedbetween two cores?What changed?Number of Wires––IO is the true bottleneckOn-chip wire density is very highClock rate– IO is slower than on-chipMultiplexingNo sharing of pins–Impact on programming model?Massive data exchange is possibleData movement is not the bottleneckprocessor affinity not that important32 Giga bits/sec ~300 Tera bits/sec10,000XCommunication BandwidthRaul Goycoolea S.Multiprocessor Programming 16416 February 2012

How long does it take for a round tripcommunication?What changed?Length of wire– Very short wires are fasterPipeline stages–––No multiplexingOn-chip is much closerBypass and Speculation?Impact on programming model?Ultra-fast synchronizationCan run real-time appson multiple cores50X~200 Cycles ~4 cyclesCommunication LatencyRaul Goycoolea S.Multiprocessor Programming 16516 February 2012

MemoryMemoryPE$$PE$$MemoryPEPE$$Memory$$PE$$ XPE$$ XPE$$ XPE$$ XMemory MemoryBasic MulticoreIBM PowerTraditionalMultiprocessorIntegrated Multicore8 Core 8 Thread Oracle T4Past, Present and the Future?Raul Goycoolea S.Multiprocessor Programming 16616 February 2012

Summary• As technology evolves, the inherent flexibility of Multiprocessor to adapts to new requirements• Processors can be used at anytime for a lots of kindsof applications• Optimization adapts processors to High PerformancerequirementsRaul Goycoolea S.Multiprocessor Programming 16716 February 2012

References• Author: Raul Goycoolea, Oracle Corporation.• A search on the WWW for "parallel programming" or "parallel computing" will yield awide variety of information.• Recommended reading:• "Designing and Building Parallel Programs". Ian Foster.  http://www-unix.mcs.anl.gov/dbpp/• "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis,Vipin Kumar.  http://www-users.cs.umn.edu/~karypis/parbook/• "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra. www.phys.uu.nl/~steen/web03/overview.html• MIT Multicore Programming Class: 6.189• Prof. Saman Amarasinghe• Photos/Graphics have been created by the author, obtained from non-copyrighted,government or public domain (such as http://commons.wikimedia.org/) sources, or usedwith the permission of authors from other presentations and web pages.168

<Insert Picture Here>Twitterhttp://twitter.com/raul_goycooleaRaul Goycoolea SeoaneKeep in TouchFacebookhttp://www.facebook.com/raul.goycooleaLinkedinhttp://www.linkedin.com/in/raulgoyBloghttp://blogs.oracle.com/raulgoy/Raul Goycoolea S.Multiprocessor Programming 16916 February 2012

Multiprocessor architecture and programming

Movatterモバイル変換

Multiprocessor architecture and programming

Recommended

More Related Content

What's hot(20)

Similar to Multiprocessor architecture and programming(20)

More from Raul Goycoolea Seoane(7)

Recently uploaded(20)

Multiprocessor architecture and programming

Editor's Notes