Movatterモバイル変換


[0]ホーム

URL:


SlideShare a Scribd company logo

Multiprocessor architecture and programming

17 likes10,150 views
Raul Goycoolea Seoane
Raul Goycoolea Seoane

The document discusses the evolution and challenges of parallel computing, outlining historical software crises and how different programming paradigms emerged to address these challenges. It also describes the principles of parallel computing, including architectures and classifications like Flynn's taxonomy, highlighting the shift from serial to parallel computing. Key concepts include the necessity for performance optimization, the transition to multicore processors, and an understanding of various parallel programming techniques.

1 of 171
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
Parallel Computing Architecture &Programming TechniquesRaul Goycoolea S.Solution Architect ManagerOracle Enterprise Architecture Group
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 216 February 2012
Antecedents ofParallelComputing
The “Software Crisis”“To put it quite bluntly: as long as there were nomachines, programming was no problem at all; whenwe had a few weak computers, programming became amild problem, and now we have gigantic computers,programming has become an equally gigantic problem."-- E. Dijkstra, 1972 Turing Award LectureRaul Goycoolea S.Multiprocessor Programming 416 February 2012
The First Software Crisis• Time Frame: ’60s and ’70s• Problem: Assembly Language ProgrammingComputers could handle larger more complex programs• Needed to get Abstraction and Portability withoutlosing PerformanceRaul Goycoolea S.Multiprocessor Programming 516 February 2012
Common PropertiesSingle flow of controlSingle memory imageDifferences:Register FileISAFunctional UnitsHow Did We Solve The First SoftwareCrisis?• High-level languages for von-Neumann machinesFORTRAN and C• Provided “common machine language” foruniprocessorsRaul Goycoolea S.Multiprocessor Programming 616 February 2012
The Second Software Crisis• Time Frame: ’80s and ’90s• Problem: Inability to build and maintain complex androbust applications requiring multi-million lines ofcode developed by hundreds of programmersComputers could handle larger more complex programs• Needed to get Composability, Malleability andMaintainabilityHigh-performance was not an issue left for Moore’s LawRaul Goycoolea S.Multiprocessor Programming 716 February 2012
How Did We Solve the SecondSoftware Crisis?• Object Oriented ProgrammingC++, C# and Java• Also…Better tools• Component libraries, PurifyBetter software engineering methodology• Design patterns, specification, testing, codereviewsRaul Goycoolea S.Multiprocessor Programming 816 February 2012
Today:Programmers are Oblivious to Processors• Solid boundary between Hardware and Software• Programmers don’t have to know anything about theprocessorHigh level languages abstract away the processorsEx: Java bytecode is machine independentMoore’s law does not require the programmers to know anythingabout the processors to get good speedups• Programs are oblivious of the processor works on allprocessorsA program written in ’70 using C still works and is much faster today• This abstraction provides a lot of freedom for theprogrammersRaul Goycoolea S.Multiprocessor Programming 916 February 2012
The Origins of a Third Crisis• Time Frame: 2005 to 20??• Problem: Sequential performance is left behind byMoore’s law• Needed continuous and reasonable performanceimprovementsto support new featuresto support larger datasets• While sustaining portability, malleability andmaintainability without unduly increasing complexityfaced by the programmer critical to keep-up with thecurrent rate of evolution in softwareRaul Goycoolea S.Multiprocessor Programming 1016 February 2012
Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 1116 February 2012
Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 1216 February 2012Intel 386Intel 486
The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 1316 February 2012
Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 1416 February 2012
Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 1516 February 2012
Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 1616 February 2012
Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 1716 February 2012
PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 1816 February 2012
Diminishing Returns• The ’80s: Superscalar expansion50% per year improvement in performanceTransistors applied to implicit parallelism- pipeline processor (10 CPI --> 1 CPI)• The ’90s: The Era of Diminishing ReturnsSqueaking out the last implicit parallelism2-way to 6-way issue, out-of-order issue, branch prediction1 CPI --> 0.5 CPIPerformance below expectations projects delayed & canceled• The ’00s: The Beginning of the Multicore EraThe need for Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 1916 February 2012
Mit Raw16 Cores2002 Intel TanglewoodDual Core IA/64Intel DempseyDual Core XeonIntel Montecito1.7 Billion transistorsDual Core IA/64Intel Pentium D(Smithfield)CancelledIntel Tejas & JayhawkUnicore (4GHz P4)IBM Power 6Dual CoreIBM Power 4 and 5Dual Cores Since 2001Intel Pentium Extreme3.2GHz Dual CoreIntel YonahDual Core MobileAMD OpteronDual CoreSun Olympus and Niagara8 Processor CoresIBM CellScalable Multicore… 1H 2005 1H 2006 2H 20062H 20052H 2004Unicores are on extinctionNow all is multicore
# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 2116 February 2012
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 2216 February 2012
Introduction toParallelArchitectures
Traditionally, software has been written for serial computation:• To be run on a single computer having a single Central Processing Unit (CPU)• A problem is broken into a discrete series of instructions• Instructions are executed one after another• Only one instruction may execute at any moment in timeWhat is Parallel Computing?Raul Goycoolea S.Multiprocessor Programming 2416 February 2012
What is Parallel Computing?In the simplest sense, parallel computing is the simultaneous use of multiplecompute resources to solve a computational problem:• To be run using multiple CPUs• A problem is broken into discrete parts that can be solved concurrently• Each part is further broken down to a series of instructions• Instructions from each part execute simultaneously on different CPUsRaul Goycoolea S.Multiprocessor Programming 2516 February 2012
Options in Parallel Computing?The compute resources might be:• A single computer with multiple processors;• An arbitrary number of computers connected by a network;• A combination of both.The computational problem should be able to:• Be broken apart into discrete pieces of work that can be solvedsimultaneously;• Execute multiple program instructions at any moment in time;• Be solved in less time with multiple compute resources than with asingle compute resource.Raul Goycoolea S.Multiprocessor Programming 2616 February 2012
27
The Real World is Massively Parallel• Parallel computing is an evolution of serial computing thatattempts to emulate what has always been the state ofaffairs in the natural world: many complex, interrelatedevents happening at the same time, yet within a sequence.For example:• Galaxy formation• Planetary movement• Weather and ocean patterns• Tectonic plate drift Rush hour traffic• Automobile assembly line• Building a jet• Ordering a hamburger at the drive through.Raul Goycoolea S.Multiprocessor Programming 2816 February 2012
Architecture ConceptsVon Neumann Architecture• Named after the Hungarian mathematician John von Neumann who first authoredthe general requirements for an electronic computer in his 1945 papers• Since then, virtually all computers have followed this basic design, differing fromearlier computers which were programmed through "hard wiring”• Comprised of four main components:• Memory• Control Unit• Arithmetic Logic Unit• Input/Output• Read/write, random access memory is used to storeboth program instructions and data• Program instructions are coded data which tellthe computer to do something• Data is simply information to be used by theprogram• Control unit fetches instructions/data from memory, decodesthe instructions and then sequentially coordinates operationsto accomplish the programmed task.• Aritmetic Unit performs basic arithmetic operations• Input/Output is the interface to the human operatorRaul Goycoolea S.Multiprocessor Programming 2916 February 2012
Flynn’s Taxonomy• There are different ways to classify parallel computers. One of the morewidely used classifications, in use since 1966, is called Flynn'sTaxonomy.• Flynn's taxonomy distinguishes multi-processor computer architecturesaccording to how they can be classified along the two independentdimensions of Instruction and Data. Each of these dimensions canhave only one of two possible states: Single or Multiple.• The matrix below defines the 4 possible classifications according toFlynn:Raul Goycoolea S.Multiprocessor Programming 3016 February 2012
Single Instruction, Single Data (SISD):• A serial (non-parallel) computer• Single Instruction: Only one instruction stream isbeing acted on by the CPU during any one clockcycle• Single Data: Only one data stream is being usedas input during any one clock cycle• Deterministic execution• This is the oldest and even today, the mostcommon type of computer• Examples: older generation mainframes,minicomputers and workstations; most modernday PCs.Raul Goycoolea S.Multiprocessor Programming 3116 February 2012
Single Instruction, Single Data (SISD):Raul Goycoolea S.Multiprocessor Programming 3216 February 2012
Single Instruction, Multiple Data(SIMD):• A type of parallel computer• Single Instruction: All processing units execute the same instruction at anygiven clock cycle• Multiple Data: Each processing unit can operate on a different data element• Best suited for specialized problems characterized by a high degree ofregularity, such as graphics/image processing.• Synchronous (lockstep) and deterministic execution• Two varieties: Processor Arrays and Vector Pipelines• Examples:• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,ETA10• Most modern computers, particularly those with graphics processor units(GPUs) employ SIMD instructions and execution units.Raul Goycoolea S.Multiprocessor Programming 3316 February 2012
Single Instruction, Multiple Data(SIMD):ILLIAC IV MasPar TM CM-2 Cell GPUCray X-MP Cray Y-MPRaul Goycoolea S.Multiprocessor Programming 3416 February 2012
• A type of parallel computer• Multiple Instruction: Each processing unit operates on the dataindependently via separate instruction streams.• Single Data: A single data stream is fed into multiple processingunits.• Few actual examples of this class of parallel computer have everexisted. One is the experimental Carnegie-Mellon C.mmp computer(1971).• Some conceivable uses might be:• multiple frequency filters operating on a single signal stream• multiple cryptography algorithms attempting to crack a single codedmessage.Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3516 February 2012
Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3616 February 2012
• A type of parallel computer• Multiple Instruction: Every processor may be executing a differentinstruction stream• Multiple Data: Every processor may be working with a differentdata stream• Execution can be synchronous or asynchronous, deterministic ornon-deterministic• Currently, the most common type of parallel computer - mostmodern supercomputers fall into this category.• Examples: most current supercomputers, networked parallelcomputer clusters and "grids", multi-processor SMP computers,multi-core PCs.Note: many MIMD architectures also include SIMD execution sub-componentsMultiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3716 February 2012
Multiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3816 February 2012
Multiple Instruction, Multiple Data(MIMD):IBM Power HP Alphaserver Intel IA32/x64Oracle SPARC Cray XT3 Oracle Exadata/ExalogicRaul Goycoolea S.Multiprocessor Programming 3916 February 2012
Parallel Computer Memory ArchitectureShared MemoryShared memory parallel computers vary widely, but generally have in common theability for all processors to access all memory as global address space.Multiple processors can operate independently but share the same memoryresources.Changes in a memory location effected by one processor are visible to all otherprocessors.Shared memory machines can be divided into two main classes based uponmemory access times: UMA and NUMA.Uniform Memory Access (UMA):• Most commonly represented today by Symmetric Multiprocessor (SMP) machines• Identical processorsNon-Uniform Memory Access (NUMA):• Often made by physically linking two or more SMPs• One SMP can directly access memory of another SMP40Raul Goycoolea S.Multiprocessor Programming 4016 February 2012
Parallel Computer Memory ArchitectureShared Memory41Shared Memory (UMA) Shared Memory (NUMA)Raul Goycoolea S.Multiprocessor Programming 4116 February 2012
Basic structure of a centralizedshared-memory multiprocessorProcessor Processor Processor ProcessorOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheMultiple processor-cache subsystems share the same physical memory, typically connected by a bus.In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniformaccess time o all memory from all processors remains.Raul Goycoolea S.Multiprocessor Programming 4216 February 2012
Processor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryInterconnection NetworkBasic Architecture of a DistributedMultiprocessorConsists of individual nodes containing a processor, some memory, typically some I/O, and an interface to aninterconnection network that connects all the nodes. Individual nodes may contain a small number ofprocessors, which may be interconnected by a small bus or a different interconnection technology, which is lessscalable than the global interconnection network.Raul Goycoolea S.Multiprocessor Programming 4316 February 2012
Communicationhow do parallel operations communicate data results?Synchronizationhow are parallel operations coordinated?Resource Managementhow are a large number of parallel tasks scheduled ontofinite hardware?Scalabilityhow large a machine can be built?Issues in Parallel Machine DesignRaul Goycoolea S.Multiprocessor Programming 4416 February 2012
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 4516 February 2012
ParallelProgrammingConcepts
ExplicitImplicitHardware CompilerSuperscalarProcessorsExplicitly Parallel ArchitecturesImplicit vs. Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 4716 February 2012
Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 4816 February 2012
Issue varying numbers of instructions per clockstatically scheduled––using compiler techniquesin-order executiondynamically scheduled–––––Extracting ILP by examining 100‟s of instructionsScheduling them in parallel as operands become availableRename registers to eliminate anti dependencesout-of-order executionSpeculative executionImplicit Parallelism: SuperscalarProcessorsRaul Goycoolea S.Multiprocessor Programming 4916 February 2012
Instruction i IF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBInstruction i+1Instruction i+2Instruction i+3Instruction i+4Instruction # 1 2 3 4 5 6 7 8IF: Instruction fetchEX : ExecutionCyclesID : Instruction decodeWB : Write backPipelining ExecutionRaul Goycoolea S.Multiprocessor Programming 5016 February 2012
Instruction type 1 2 3 4 5 6 7CyclesIntegerFloating pointIFIFIDIDEXEXWBWBIntegerFloating pointIntegerFloating pointIntegerFloating pointIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWB2-issue super-scalar machineSuper-Scalar ExecutionRaul Goycoolea S.Multiprocessor Programming 5116 February 2012
Intrinsic data dependent (aka true dependence) on Instructions:I: add r1,r2,r3J: sub r4,r1,r3If two instructions are data dependent, they cannot executesimultaneously, be completely overlapped or execute in out-of-orderIf data dependence caused a hazard in pipeline,called a Read After Write (RAW) hazardData Dependence and HazardsRaul Goycoolea S.Multiprocessor Programming 5216 February 2012
HW/SW must preserve program order:order instructions would execute in if executed sequentially asdetermined by original source programDependences are a property of programsImportance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possiblybe exploitedGoal: exploit parallelism by preserving program order onlywhere it affects the outcome of the programILP and Data Dependencies, HazardsRaul Goycoolea S.Multiprocessor Programming 5316 February 2012
Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data betweenthe instructions associated with that name; 2 versions ofname dependenceInstrJ writes operand before InstrIreads itI: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Read (WAR) hazardName Dependence #1: Anti-dependeceRaul Goycoolea S.Multiprocessor Programming 5416 February 2012
Instruction writes operand before InstrIwrites it.I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “output dependence” by compiler writers.This also results from the reuse of name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Write (WAW) hazardInstructions involved in a name dependence can executesimultaneously if name used in instructions is changed soinstructions do not conflictRegister renaming resolves name dependence for registersRenaming can be done either by compiler or by HWName Dependence #1: OutputDependenceRaul Goycoolea S.Multiprocessor Programming 5516 February 2012
Every instruction is control dependent on some set ofbranches, and, in general, these control dependencies mustbe preserved to preserve program orderif p1 {S1;};if p2 {S2;}S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.Control dependence need not be preservedwilling to execute instructions that should not have beenexecuted, thereby violating the control dependences, if cando so without affecting correctness of the programSpeculative ExecutionControl DependenciesRaul Goycoolea S.Multiprocessor Programming 5616 February 2012
Greater ILP: Overcome control dependence by hardwarespeculating on outcome of branches and executingprogram as if guesses were correctSpeculation ⇒ fetch, issue, and executeinstructions as if branch predictions were alwayscorrectDynamic scheduling ⇒ only fetches and issuesinstructionsEssentially a data flow execution model: Operationsexecute as soon as their operands are availableSpeculationRaul Goycoolea S.Multiprocessor Programming 5716 February 2012
Different predictorsBranch PredictionValue PredictionPrefetching (memory access pattern prediction)InefficientPredictions can go wrongHas to flush out wrongly predicted dataWhile not impacting performance, it consumes powerSpeculation in Rampant in ModernSuperscalarsRaul Goycoolea S.Multiprocessor Programming 5816 February 2012
Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 5916 February 2012
Parallelism is exposed to softwareCompiler or ProgrammerMany different formsLoosely coupled Multiprocessors to tightly coupled VLIWExplicit Parallel ProcessorsRaul Goycoolea S.Multiprocessor Programming 6016 February 2012
Throughput per CycleOne OperationLatency in CyclesParallelism = Throughput * LatencyTo maintain throughput T/cycle when each operation haslatency L cycles, need T*L independent operationsFor fixed parallelism:decreased latency allows increased throughputdecreased throughput allows increased latency toleranceLittle’s LawRaul Goycoolea S.Multiprocessor Programming 6116 February 2012
TimeTimeTimeTimeData-Level Parallelism (DLP)Instruction-Level Parallelism (ILP)PipeliningThread-Level Parallelism (TLP)Types of Software ParallelismRaul Goycoolea S.Multiprocessor Programming 6216 February 2012
PipeliningThreadParallelDataParallelInstructionParallelTranslating Parallelism TypesRaul Goycoolea S.Multiprocessor Programming 6316 February 2012
What is a sequential program?A single thread of control that executes one instruction and when it isfinished execute the next logical instructionWhat is a concurrent program?A collection of autonomous sequential threads, executing (logically) inparallelThe implementation (i.e. execution) of a collection of threads can be:Multiprogramming– Threads multiplex their executions on a single processor.Multiprocessing– Threads multiplex their executions on a multiprocessor or a multicore systemDistributed Processing– Processes multiplex their executions on several different machinesWhat is concurrency?Raul Goycoolea S.Multiprocessor Programming 6416 February 2012
Concurrency is not (only) parallelismInterleaved ConcurrencyLogically simultaneous processingInterleaved execution on a singleprocessorParallelismPhysically simultaneous processingRequires a multiprocessors or amulticore systemTimeTimeABCABCConcurrency and ParallelismRaul Goycoolea S.Multiprocessor Programming 6516 February 2012
There are a lot of ways to use Concurrency inProgrammingSemaphoresBlocking & non-blocking queuesConcurrent hash mapsCopy-on-write arraysExchangersBarriersFuturesThread pool supportOther Types of SynchronizationRaul Goycoolea S.Multiprocessor Programming 6616 February 2012
DeadlockTwo or more threads stop and wait for each otherLivelockTwo or more threads continue to execute, but makeno progress toward the ultimate goalStarvationSome thread gets deferred foreverLack of fairnessEach thread gets a turn to make progressRace ConditionSome possible interleaving of threads results in anundesired computation resultPotential Concurrency ProblemsRaul Goycoolea S.Multiprocessor Programming 6716 February 2012
Concurrency and Parallelism are important conceptsin Computer ScienceConcurrency can simplify programmingHowever it can be very hard to understand and debugconcurrent programsParallelism is critical for high performanceFrom Supercomputers in national labs toMulticores and GPUs on your desktopConcurrency is the basis for writing parallel programsNext Lecture: How to write a Parallel ProgramParallelism ConclusionsRaul Goycoolea S.Multiprocessor Programming 6816 February 2012
Shared memory––––Ex: Intel Core 2 Duo/QuadOne copy of data sharedamong many coresAtomicity, locking andsynchronizationessential for correctnessMany scalability issuesDistributed memory––––Ex: CellCores primarily access localmemoryExplicit data exchangebetween coresData distribution andcommunication orchestrationis essential for performanceP1 P2 P3 PnMemoryInterconnection NetworkInterconnection NetworkP1 P2 P3 PnM1 M2 M3 MnTwo primary patterns of multicore architecture designArchitecture RecapRaul Goycoolea S.Multiprocessor Programming 6916 February 2012
Processor 1…n ask for XThere is only one place to lookCommunication throughshared variablesRace conditions possibleUse synchronization to protect from conflictsChange how data is stored to minimize synchronizationP1 P2 P3 PnMemoryxInterconnection NetworkProgramming Shared Memory ProcessorsRaul Goycoolea S.Multiprocessor Programming 7016 February 2012
Data parallelPerform same computationbut operate on different dataA single process can forkmultiple concurrent threadsEach thread encapsulate its own execution pathEach thread has local state and shared resourcesThreads communicate through shared resourcessuch as global memoryfor (i = 0; i < 12; i++)C[i] = A[i] + B[i];i=0i=1i=2i=3i=8i=9i = 10i = 11i=4i=5i=6i=7join (barrier)fork (threads)Example of ParallelizationRaul Goycoolea S.Multiprocessor Programming 7116 February 2012
int A[12] = {...}; int B[12] = {...}; int C[12];void add_arrays(int start){int i;for (i = start; i < start + 4; i++)C[i] = A[i] + B[i];}int main (int argc, char *argv[]){pthread_t threads_ids[3];int rc, t;for(t = 0; t < 4; t++) {rc = pthread_create(&thread_ids[t],NULL /* attributes */,add_arrays /* function */,t * 4 /* args to function */);}pthread_exit(NULL);}join (barrier)i=0i=1i=2i=3i=4i=5i=6i=7i=8i=9i = 10i = 11fork (threads)Example Parallelization with ThreadsRaul Goycoolea S.Multiprocessor Programming 7216 February 2012
Data parallelismPerform same computationbut operate on different dataControl parallelismPerform different functionsfork (threads)join (barrier)pthread_create(/* thread id */,/* attributes */,/*/*any functionargs to function*/,*/);Types of ParallelismRaul Goycoolea S.Multiprocessor Programming 7316 February 2012
Uniform Memory Access (UMA)Centrally located memoryAll processors are equidistant (access times)Non-Uniform Access (NUMA)Physically partitioned but accessible by allProcessors have the same address spacePlacement of data affects performanceMemory Access Latency in SharedMemory ArchitecturesRaul Goycoolea S.Multiprocessor Programming 7416 February 2012
Coverage or extent of parallelism in algorithmGranularity of data partitioning among processorsLocality of computation and communication… so how do I parallelize my program?Summary of Parallel PerformanceFactorsRaul Goycoolea S.Multiprocessor Programming 7516 February 2012
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 7616 February 2012
ParallelDesignPatterns
P0Tasks Processes ProcessorsP1P2 P3p0 p1p2 p3p0 p1p2 p3PartitioningSequentialcomputationParallelprogramdecompositionassignmentorchestrationmappingCommon Steps to Create a ParallelProgram
Identify concurrency and decide at what level toexploit itBreak up computation into tasks to be dividedamong processesTasks may become available dynamicallyNumber of tasks may vary with timeEnough tasks to keep processors busyNumber of tasks available at a time is upper bound onachievable speedupDecomposition (Amdahl’s Law)
Specify mechanism to divide work among coreBalance work and reduce communicationStructured approaches usually work wellCode inspection or understanding of applicationWell-known design patternsAs programmers, we worry about partitioning firstIndependent of architecture or programming modelBut complexity often affect decisions!Granularity
Computation and communication concurrencyPreserve locality of dataSchedule tasks to satisfy dependences earlyOrchestration and Mapping
Provides a cookbook to systematically guide programmersDecompose, Assign, Orchestrate, MapCan lead to high quality solutions in some domainsProvide common vocabulary to the programming communityEach pattern has a name, providing a vocabulary fordiscussing solutionsHelps with software reusability, malleability, and modularityWritten in prescribed format to allow the reader toquickly understand the solution and its contextOtherwise, too difficult for programmers, and software will notfully exploit parallel hardwareParallel Programming by Pattern
Berkeley architecture professorChristopher AlexanderIn 1977, patterns for cityplanning, landscaping, andarchitecture in an attempt tocapture principles for “living”designHistory
Example 167 (p. 783)
Design Patterns: Elements of Reusable Object-Oriented Software (1995)Gang of Four (GOF): Gamma, Helm, Johnson, VlissidesCatalogue of patternsCreation, structural, behavioralPatterns in Object-OrientedProgramming
Algorithm ExpressionFinding ConcurrencyExpose concurrent tasksAlgorithm StructureMap tasks to processes toexploit parallel architecture4 Design SpacesSoftware ConstructionSupporting StructuresCode and data structuringpatternsImplementation MechanismsLow level mechanisms usedto write parallel programsPatterns for ParallelProgramming. Mattson,Sanders, and Massingill(2005).Patterns for Parallelizing Programs
splitfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationspatially encoded macroblocksdifferentially codedmotion vectorsMotion Vector DecodeRepeatmotion vectorsMPEG bit streamVLDmacroblocks, motion vectorsMPEG DecoderjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?
Task decompositionIndependent coarse-grainedcomputationInherent to algorithmSequence of statements(instructions) that operatetogether as a groupCorresponds to some logicalpart of programUsually follows from the wayprogrammer thinks about aproblemjoinmotion vectorsspatially encoded macroblocksIDCTSaturationMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?
joinmotion vectorsSaturationspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation is appliedto small data chunks derivedfrom large data setHere’s my algorithm, Where’s theconcurrency?
motion vectorsspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation many dataPipeline decompositionData assembly linesProducer-consumer chainsHere’s my algorithm, Where’s theconcurrency?
Algorithms start with a good understanding of theproblem being solvedPrograms often naturally decompose into tasksTwo common decompositions are––Function calls andDistinct loop iterationsEasier to start with many tasks and later fuse them,rather than too few tasks and later try to split themGuidelines for Task Decomposition
FlexibilityProgram design should afford flexibility in the number andsize of tasks generated––Tasks should not tied to a specific architectureFixed tasks vs. Parameterized tasksEfficiencyTasks should have enough work to amortize the cost ofcreating and managing themTasks should be sufficiently independent so that managingdependencies doesn‟t become the bottleneckSimplicityThe code has to remain readable and easy to understand,and debugGuidelines for Task Decomposition
Data decomposition is often implied by taskdecompositionProgrammers need to address task and datadecomposition to create a parallel programWhich decomposition to start with?Data decomposition is a good starting point whenMain computation is organized around manipulation of alarge data structureSimilar operations are applied to different parts of thedata structureGuidelines for Data DecompositionRaul Goycoolea S.Multiprocessor Programming 9316 February 2012
Array data structuresDecomposition of arrays along rows, columns, blocksRecursive data structuresExample: decomposition of trees into sub-treesproblemcomputesubproblemcomputesubproblemcomputesubproblemcomputesubproblemmergesubproblemmergesubproblemmergesolutionsubproblemsplitsubproblemsplitsplitCommon Data DecompositionsRaul Goycoolea S.Multiprocessor Programming 9416 February 2012
FlexibilitySize and number of data chunks should support a widerange of executionsEfficiencyData chunks should generate comparable amounts ofwork (for load balancing)SimplicityComplex data compositions can get difficult to manageand debugRaul Goycoolea S.Multiprocessor Programming 9516 February 2012Guidelines for Data Decompositions
Data is flowing through a sequence of stagesAssembly line is a good analogyWhat’s a prime example of pipeline decomposition incomputer architecture?Instruction pipeline in modern CPUsWhat’s an example pipeline you may use in your UNIX shell?Pipes in UNIX: cat foobar.c | grep bar | wcOther examplesSignal processingGraphicsZigZagIQuantizationIDCTSaturationGuidelines for Pipeline DecompositionRaul Goycoolea S.Multiprocessor Programming 9616 February 2012
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 9716 February 2012
Performance &Optimization
Coverage or extent of parallelism in algorithmAmdahl‟s LawGranularity of partitioning among processorsCommunication cost and load balancingLocality of computation and communicationCommunication between processors or betweenprocessors and their memoriesReview: Keys to Parallel Performance
n/mBt overlap)C f (o lfrequencyof messagesoverhead permessage(at both ends)network delayper messagenumber of messagesamount of latencyhidden by concurrencywith computationtotal data sentcost induced bycontention permessagebandwidth along path(determined by network)Communication Cost Model
synchronizationpointGet DataComputeGet DataCPU is idleMemory is idleComputeOverlapping Communication withComputation
Computation to communication ratio limitsperformance gains from pipeliningGet DataComputeGet DataComputeWhere else to look for performance?Limits in Pipelining Communication
Determined by program implementation andinteractions with the architectureExamples:Poor distribution of data across distributed memoriesUnnecessarily fetching data that is not usedRedundant data fetchesArtifactual Communication
In uniprocessors, CPU communicates with memoryLoads and stores are to uniprocessors as_______ and ______ are to distributed memorymultiprocessorsHow is communication overlap enhanced inuniprocessors?Spatial localityTemporal locality“get” “put”Lessons From Uniprocessors
CPU asks for data at address 1000Memory sends data at address 1000 … 1064Amount of data sent depends on architectureparameters such as the cache block sizeWorks well if CPU actually ends up using data from1001, 1002, …, 1064Otherwise wasted bandwidth and cache capacitySpatial Locality
Main memory access is expensiveMemory hierarchy adds small but fast memories(caches) near the CPUMemories get bigger as distancefrom CPU increasesCPU asks for data at address 1000Memory hierarchy anticipates more accesses to sameaddress and stores a local copyWorks well if CPU actually ends up using data from 1000 overand over and over …Otherwise wasted cache capacitymainmemorycache(level 2)cache(level 1)Temporal Locality
Data is transferred in chunks to amortizecommunication costCell: DMA gets up to 16KUsually get a contiguous chunk of memorySpatial localityComputation should exhibit good spatial localitycharacteristicsTemporal localityReorder computation to maximize use of data fetchedReducing Artifactual Costs inDistributed Memory Architectures
Tasks mapped to execution units (threads)Threads run on individual processors (cores)finish line: sequential time + longest parallel timeTwo keys to faster executionLoad balance the work among the processorsMake execution on each processor fastersequentialparallelsequentialparallelSingle Thread Performance
Need some way ofmeasuring performanceCoarse grainedmeasurements% gcc sample.c% time a.out2.312u 0.062s 0:02.50 94.8%% gcc sample.c –O3% time a.out1.921u 0.093s 0:02.03 99.0%… but did we learn muchabout what’s going on?#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {a[i] = 0;}}int main() {double s=0,s2=0; int i,j;for (j = 0; j < T; j++) {for (i = 0; i < N; i++) {b[i] = 0;}cleara(a);memset(a,0,sizeof(a));for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];}}printf("s %f s2 %fn",s,s2);}record stop timerecord start timeUnderstanding Performance
Increasingly possible to get accurate measurementsusing performance countersSpecial registers in the hardware to measure eventsInsert code to start, read, and stop counterMeasure exactly what you want, anywhere you wantCan measure communication and computation durationBut requires manual changesMonitoring nested scopes is an issueHeisenberg effect: counters can perturb execution timetimestopclear/startMeasurements Using Counters
Event-based profilingInterrupt execution when an event counter reaches athresholdTime-based profilingInterrupt execution every t secondsWorks without modifying your codeDoes not require that you know where problem might beSupports multiple languages and programming modelsQuite efficient for appropriate sampling frequenciesDynamic Profiling
Cycles (clock ticks)Pipeline stallsCache hitsCache missesNumber of instructionsNumber of loadsNumber of storesNumber of floating point operations…Counter Examples
Processor utilizationCycles / Wall Clock TimeInstructions per cycleInstructions / CyclesInstructions per memory operationInstructions / Loads + StoresAverage number of instructions per load missInstructions / L1 Load MissesMemory trafficLoads + Stores * Lk Cache Line SizeBandwidth consumedLoads + Stores * Lk Cache Line Size / Wall Clock TimeMany othersCache miss rateBranch misprediction rate…Useful Derived Measurements
applicationsourcerun(profilesexecution)performanceprofilebinaryobject codecompilerbinary analysisinterpret profilesourcecorrelationCommon Profiling Workflow
GNU gprofWidely available with UNIX/Linux distributionsgcc –O2 –pg foo.c –o foo./foogprof fooHPC Toolkithttp://www.hipersoft.rice.edu/hpctoolkit/PAPIhttp://icl.cs.utk.edu/papi/VTunehttp://www.intel.com/cd/software/products/asmo-na/eng/vtune/Many othersPopular Runtime Profiling Tools
Instruction level parallelismMultiple functional units, deeply pipelined, speculation, ...Data level parallelismSIMD (Single Inst, Multiple Data): short vector instructions(multimedia extensions)–––Hardware is simpler, no heavily ported register filesInstructions are more compactReduces instruction fetch bandwidthComplex memory hierarchiesMultiple level caches, may outstanding misses,prefetching, …Performance un Uniprocessorstime = compute + wait
Single Instruction, Multiple DataSIMD registers hold short vectorsInstruction operates on all elements in SIMD register at onceabcVector codefor (int i = 0; i < n; i += 4) {c[i:i+3] = a[i:i+3] + b[i:i+3]}SIMD registerScalar codefor (int i = 0; i < n; i+=1) {c[i] = a[i] + b[i]}abcscalar registerSingle Instruction, Multiple Data
For Example CellSPU has 128 128-bit registersAll instructions are SIMD instructionsRegisters are treated as short vectors of 8/16/32-bitintegers or single/double-precision floatsInstruction SetAltiVecMMX/SSE3DNow!VISMAX2MVIMDMXArchitecturePowerPCIntelAMDSunHPAlphaMIPS VSIMD Width12864/1286464646464Floating PointyesyesyesnononoyesSIMD in Major Instruction SetArchitectures (ISAs)
Library calls and inline assemblyDifficult to programNot portableDifferent extensions to the same ISAMMX and SSESSE vs. 3DNow!Compiler vs. Crypto Oracle T4Using SIMD Instructions
Tune the parallelism firstThen tune performance on individual processorsModern processors are complexNeed instruction level parallelism for performanceUnderstanding performance requires a lot of probingOptimize for the memory hierarchyMemory is much slower than processorsMulti-layer memory hierarchies try to hide the speed gapData locality is essential for performanceProgramming for Performance
May have to change everything!Algorithms, data structures, program structureFocus on the biggest performance impedimentsToo many issues to study everythingRemember the law of diminishing returnsProgramming for Performance
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 12216 February 2012
ParallelCompilers
Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationCompilers OutlineRaul Goycoolea S.Multiprocessor Programming 12416 February 2012
Instruction Level Parallelism(ILP)Task Level Parallelism (TLP)Loop Level Parallelism (LLP)or Data ParallelismPipeline ParallelismDivide and ConquerParallelismScheduling and HardwareMainly by handHand or Compiler GeneratedHardware or StreamingRecursive functionsTypes of ParallelismRaul Goycoolea S.Multiprocessor Programming 12516 February 2012
90% of the execution time in 10% of the codeMostly in loopsIf parallel, can get good performanceLoad balancingRelatively easy to analyzeWhy Loops?Raul Goycoolea S.Multiprocessor Programming 12616 February 2012
FORALLNo “loop carrieddependences”Fully parallelFORACROSSSome “loop carrieddependences”Programmer Defined Parallel LoopRaul Goycoolea S.Multiprocessor Programming 12716 February 2012
Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 12816 February 2012
Finding FORALL Loops out of FOR loopsExamplesFOR I = 0 to 5A[I+1] = A[I] + 1FOR I = 0 to 5A[I] = A[I+6] + 1For I = 0 to 5A[2*I] = A[2*I + 1] + 1Parallelizing CompilersRaul Goycoolea S.Multiprocessor Programming 12916 February 2012
True dependencea == aAnti dependence= aa =Output dependenceaa==Definition:Data dependence exists for a dynamic instance i and j iffeither i or j is a write operationi and j refer to the same variablei executes before jHow about array accesses within loops?DependencesRaul Goycoolea S.Multiprocessor Programming 13016 February 2012
Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 13116 February 2012
FOR I = 0 to 5A[I] = A[I] + 10 1 2Iteration Space0 1 2 3 4 5Data Space3 4 5 6 7 8 9 10 11 12A[I]A[I]A[I]A[I]A[I]= A[I]= A[I]= A[I]= A[I]= A[I]Array Access in a LoopRaul Goycoolea S.Multiprocessor Programming 13216 February 2012
Find data dependences in loopFor every pair of array acceses to the same arrayIf the first access has at least one dynamic instance (an iteration) inwhich it refers to a location in the array that the second access alsorefers to in at least one of the later dynamic instances (iterations).Then there is a data dependence between the statements(Note that same array can refer to itself – output dependences)DefinitionLoop-carried dependence:dependence that crosses a loop boundaryIf there are no loop carried dependences are parallelizableRecognizing FORALL LoopsRaul Goycoolea S.Multiprocessor Programming 13316 February 2012
FOR I = 1 to nFOR J = 1 to nA[I, J] = A[I-1, J+1] + 1FOR I = 1 to nFOR J = 1 to nA[I] = A[I-1] + 1JJIIWhat is the Dependence?Raul Goycoolea S.Multiprocessor Programming 13416 February 2012
Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 13516 February 2012
Scalar PrivatizationReduction RecognitionInduction Variable IdentificationArray PrivatizationInterprocedural ParallelizationLoop TransformationsGranularity of ParallelismIncreasing ParallelizationOpportunitiesRaul Goycoolea S.Multiprocessor Programming 13616 February 2012
ExampleFOR i = 1 to nX = A[i] * 3;B[i] = X;Is there a loop carried dependence?What is the type of dependence?Scalar PrivatizationRaul Goycoolea S.Multiprocessor Programming 13716 February 2012
Reduction Analysis:Only associative operationsThe result is never used within the loopTransformationInteger Xtmp[NUMPROC];Barrier();FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)Xtmp[myPid] = Xtmp[myPid] + A[i];Barrier();If(myPid == 0) {FOR p = 0 to NUMPROC-1X = X + Xtmp[p];…Reduction RecognitionRaul Goycoolea S.Multiprocessor Programming 13816 February 2012
ExampleFOR i = 0 to NA[i] = 2^i;After strength reductiont = 1FOR i = 0 to NA[i] = t;t = t*2;What happened to loop carried dependences?Need to do opposite of this!Perform induction variable analysisRewrite IVs as a function of the loop variableInduction VariablesRaul Goycoolea S.Multiprocessor Programming 13916 February 2012
Similar to scalar privatizationHowever, analysis is more complexArray Data Dependence Analysis:Checks if two iterations access the same locationArray Data Flow Analysis:Checks if two iterations access the same valueTransformationsSimilar to scalar privatizationPrivate copy for each processor or expand with an additionaldimensionArray PrivatizationRaul Goycoolea S.Multiprocessor Programming 14016 February 2012
Function calls will make a loop unparallelizatbleReduction of available parallelismA lot of inner-loop parallelismSolutionsInterprocedural AnalysisInliningInterprocedural ParallelizationRaul Goycoolea S.Multiprocessor Programming 14116 February 2012
Cache Coherent Shared Memory MachineGenerate code for the parallel loop nestNo Cache Coherent Shared Memoryor Distributed Memory MachinesGenerate code for the parallel loop nestIdentify communicationGenerate communication codeCommunication Code GenerationRaul Goycoolea S.Multiprocessor Programming 14216 February 2012
Eliminating redundant communicationCommunication aggregationMulti-cast identificationLocal memory managementCommunication OptimizationsRaul Goycoolea S.Multiprocessor Programming 14316 February 2012
Automatic parallelization of loops with arraysRequires Data Dependence AnalysisIteration space & data space abstractionAn integer programming problemMany optimizations that’ll increase parallelismTransforming loop nests and communication code generationFourier-Motzkin Elimination provides a nice frameworkSummaryRaul Goycoolea S.Multiprocessor Programming 14416 February 2012
<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 14516 February 2012
Future ofParallelArchitectures
"I think there is a world market formaybe five computers.“– Thomas Watson, chairman of IBM, 1949"There is no reason in the worldanyone would want a computer in theirhome. No reason.”– Ken Olsen, Chairman, DEC, 1977"640K of RAM ought to be enough foranybody.”– Bill Gates, 1981Predicting the Future is Always RiskyRaul Goycoolea S.Multiprocessor Programming 14716 February 2012
EvolutionRelatively easy to predictExtrapolate the trendsRevolutionA completely new technology or solutionHard to PredictParadigm Shifts can occur in bothFuture = Evolution + RevolutionRaul Goycoolea S.Multiprocessor Programming 14816 February 2012
EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 14916 February 2012
Look at the trendsMoore‟s LawPower ConsumptionWire DelayHardware ComplexityParallelizing CompilersProgram Design MethodologiesDesign Drivers are different inDifferent GenerationsEvolutionRaul Goycoolea S.Multiprocessor Programming 15016 February 2012
Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 15116 February 2012
Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 15216 February 2012Intel 386Intel 486
The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 15316 February 2012
Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 15416 February 2012
Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 15516 February 2012
Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 15616 February 2012
Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 15716 February 2012
PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 15816 February 2012
1970 1980 1990 2000 2010Improvement in Automatic ParallelizationAutomaticParallelizingCompilers forFORTRANVectorizationtechnologyPrevalence of typeunsafe languages andcomplex datastructures (C, C++)Typesafelanguages(Java, C#)Demanddriven byMulticores?Compiling forInstructionLevelParallelismRaul Goycoolea S.Multiprocessor Programming 15916 February 2012
# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 16016 February 2012
EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 16116 February 2012
Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses–not just a few supercomputer super-usersNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16216 February 2012
Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses– not just a few supercomputer super-usersNot your same old multiprocessor problemHow does going from Multiprocessors to Multicores impactprograms?What changed?Where is the Impact?––Communication BandwidthCommunication LatencyNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16316 February 2012
How much data can be communicatedbetween two cores?What changed?Number of Wires––IO is the true bottleneckOn-chip wire density is very highClock rate– IO is slower than on-chipMultiplexingNo sharing of pins–Impact on programming model?Massive data exchange is possibleData movement is not the bottleneckprocessor affinity not that important32 Giga bits/sec ~300 Tera bits/sec10,000XCommunication BandwidthRaul Goycoolea S.Multiprocessor Programming 16416 February 2012
How long does it take for a round tripcommunication?What changed?Length of wire– Very short wires are fasterPipeline stages–––No multiplexingOn-chip is much closerBypass and Speculation?Impact on programming model?Ultra-fast synchronizationCan run real-time appson multiple cores50X~200 Cycles ~4 cyclesCommunication LatencyRaul Goycoolea S.Multiprocessor Programming 16516 February 2012
MemoryMemoryPE$$PE$$MemoryPEPE$$Memory$$PE$$ XPE$$ XPE$$ XPE$$ XMemory MemoryBasic MulticoreIBM PowerTraditionalMultiprocessorIntegrated Multicore8 Core 8 Thread Oracle T4Past, Present and the Future?Raul Goycoolea S.Multiprocessor Programming 16616 February 2012
Summary• As technology evolves, the inherent flexibility of Multiprocessor to adapts to new requirements• Processors can be used at anytime for a lots of kindsof applications• Optimization adapts processors to High PerformancerequirementsRaul Goycoolea S.Multiprocessor Programming 16716 February 2012
References• Author: Raul Goycoolea, Oracle Corporation.• A search on the WWW for "parallel programming" or "parallel computing" will yield awide variety of information.• Recommended reading:• "Designing and Building Parallel Programs". Ian Foster. 
http://www-unix.mcs.anl.gov/dbpp/• "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis,Vipin Kumar. 
http://www-users.cs.umn.edu/~karypis/parbook/• "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra.
www.phys.uu.nl/~steen/web03/overview.html• MIT Multicore Programming Class: 6.189• Prof. Saman Amarasinghe• Photos/Graphics have been created by the author, obtained from non-copyrighted,government or public domain (such as http://commons.wikimedia.org/) sources, or usedwith the permission of authors from other presentations and web pages.168
<Insert Picture Here>Twitterhttp://twitter.com/raul_goycooleaRaul Goycoolea SeoaneKeep in TouchFacebookhttp://www.facebook.com/raul.goycooleaLinkedinhttp://www.linkedin.com/in/raulgoyBloghttp://blogs.oracle.com/raulgoy/Raul Goycoolea S.Multiprocessor Programming 16916 February 2012
Questions?
Multiprocessor architecture and programming
Ad

Recommended

PDF
Multicore Computers
Dr. A. B. Shinde
 
PPTX
Computer architecture multi core processor
Mazin Alwaaly
 
PPTX
Multicore processor by Ankit Raj and Akash Prajapati
Ankit Raj
 
PPT
Multicore Processors
Smruti Sarangi
 
PPT
Multicore computers
Syed Zaid Irshad
 
PPT
Pipeline parallelism
Dr. C.V. Suresh Babu
 
PPTX
Single and Multi core processor
Munaam Munawar
 
PPTX
CA presentation of multicore processor
Zeeshan Aslam
 
PPTX
Multi-core processor and Multi-channel memory architecture
Umair Amjad
 
DOC
Introduction to multi core
mukul bhardwaj
 
PDF
Introduction to multicore .ppt
Rajagopal Nagarajan
 
PPTX
Multicore processors and its advantages
Nitesh Tudu
 
PPT
Multi-core architectures
nextlib
 
PPTX
Multicore processing
guestc0be34a
 
PPT
Multi core-architecture
Piyush Mittal
 
PPTX
Multicore Processor Technology
Venkata Raja Paruchuru
 
PPTX
Parallel Algorithms Advantages and Disadvantages
Murtadha Alsabbagh
 
PPTX
Multi core processor
Muhammad Ishaq
 
PPTX
Graphics processing uni computer archiecture
Haris456
 
PPT
chap 18 multicore computers
Sher Shah Merkhel
 
PPT
Parallel processing
rajshreemuthiah
 
PPTX
29092013042656 multicore-processor-technology
Sindhu Nathan
 
PPTX
Multicore Processsors
Aveen Meena
 
PPTX
Multi core processors
Nipun Sharma
 
PPTX
Difference between Single core, Dual core and Quad core Processors
Deep Kakkad
 
PPTX
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
PPT
Quad Core Processors - Technology Presentation
vinaya.hs
 
PDF
27 multicore
ssuser47ae65
 
PPTX
Unit 1 deeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeep.pptx
chingcho417
 
PPTX
20090720 smith
Michael Karpov
 

More Related Content

What's hot(20)

PPTX
Multi-core processor and Multi-channel memory architecture
Umair Amjad
 
DOC
Introduction to multi core
mukul bhardwaj
 
PDF
Introduction to multicore .ppt
Rajagopal Nagarajan
 
PPTX
Multicore processors and its advantages
Nitesh Tudu
 
PPT
Multi-core architectures
nextlib
 
PPTX
Multicore processing
guestc0be34a
 
PPT
Multi core-architecture
Piyush Mittal
 
PPTX
Multicore Processor Technology
Venkata Raja Paruchuru
 
PPTX
Parallel Algorithms Advantages and Disadvantages
Murtadha Alsabbagh
 
PPTX
Multi core processor
Muhammad Ishaq
 
PPTX
Graphics processing uni computer archiecture
Haris456
 
PPT
chap 18 multicore computers
Sher Shah Merkhel
 
PPT
Parallel processing
rajshreemuthiah
 
PPTX
29092013042656 multicore-processor-technology
Sindhu Nathan
 
PPTX
Multicore Processsors
Aveen Meena
 
PPTX
Multi core processors
Nipun Sharma
 
PPTX
Difference between Single core, Dual core and Quad core Processors
Deep Kakkad
 
PPTX
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
PPT
Quad Core Processors - Technology Presentation
vinaya.hs
 
PDF
27 multicore
ssuser47ae65
 
Multi-core processor and Multi-channel memory architecture
Umair Amjad
 
Introduction to multi core
mukul bhardwaj
 
Introduction to multicore .ppt
Rajagopal Nagarajan
 
Multicore processors and its advantages
Nitesh Tudu
 
Multi-core architectures
nextlib
 
Multicore processing
guestc0be34a
 
Multi core-architecture
Piyush Mittal
 
Multicore Processor Technology
Venkata Raja Paruchuru
 
Parallel Algorithms Advantages and Disadvantages
Murtadha Alsabbagh
 
Multi core processor
Muhammad Ishaq
 
Graphics processing uni computer archiecture
Haris456
 
chap 18 multicore computers
Sher Shah Merkhel
 
Parallel processing
rajshreemuthiah
 
29092013042656 multicore-processor-technology
Sindhu Nathan
 
Multicore Processsors
Aveen Meena
 
Multi core processors
Nipun Sharma
 
Difference between Single core, Dual core and Quad core Processors
Deep Kakkad
 
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
Quad Core Processors - Technology Presentation
vinaya.hs
 
27 multicore
ssuser47ae65
 

Similar to Multiprocessor architecture and programming(20)

PPTX
Unit 1 deeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeep.pptx
chingcho417
 
PPTX
20090720 smith
Michael Karpov
 
PDF
Uniprocessors to multiprocessors Uniprocessors to multiprocessors
HealthFitness28
 
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 
PPTX
Architecting Solutions for the Manycore Future
Talbott Crowell
 
PPT
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
PPT
End of a trend
mml2000
 
PPT
Parallel architecture
Mr SMAK
 
PDF
chap2_slidesforparallelcomputingananthgarama
doomzday27
 
PDF
Multicore computing
Arthur Sanchez
 
PDF
doing_parallel.pdf
Jayanti Prasad Ph.D.
 
PDF
Parallel Computing - Lec 2
Shah Zaib
 
PPT
Parallelism Processor Design
Sri Prasanna
 
PDF
Lecture 1 introduction to parallel and distributed computing
Vajira Thambawita
 
PPT
Pdc lecture1
SyedSafeer1
 
PDF
A REVIEW ON PARALLEL COMPUTING
Amy Roman
 
PPTX
intro, definitions, basic laws+.pptx
ssuser413a98
 
PPTX
Introduction to Parallel Computing
Roshan Karunarathna
 
PPT
Lecture1
tt_aljobory
 
PDF
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors Jea...
lelluakpai
 
Unit 1 deeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeep.pptx
chingcho417
 
20090720 smith
Michael Karpov
 
Uniprocessors to multiprocessors Uniprocessors to multiprocessors
HealthFitness28
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
VAISHNAVI MADHAN
 
Architecting Solutions for the Manycore Future
Talbott Crowell
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
KRamasamy2
 
End of a trend
mml2000
 
Parallel architecture
Mr SMAK
 
chap2_slidesforparallelcomputingananthgarama
doomzday27
 
Multicore computing
Arthur Sanchez
 
doing_parallel.pdf
Jayanti Prasad Ph.D.
 
Parallel Computing - Lec 2
Shah Zaib
 
Parallelism Processor Design
Sri Prasanna
 
Lecture 1 introduction to parallel and distributed computing
Vajira Thambawita
 
Pdc lecture1
SyedSafeer1
 
A REVIEW ON PARALLEL COMPUTING
Amy Roman
 
intro, definitions, basic laws+.pptx
ssuser413a98
 
Introduction to Parallel Computing
Roshan Karunarathna
 
Lecture1
tt_aljobory
 
Microprocessor Architecture From Simple Pipelines To Chip Multiprocessors Jea...
lelluakpai
 
Ad

More from Raul Goycoolea Seoane(7)

PDF
Xertica work transformation with Google
Raul Goycoolea Seoane
 
PDF
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Raul Goycoolea Seoane
 
PDF
Transformacion Digital con Cloud Computing
Raul Goycoolea Seoane
 
PDF
Cloud Digital Transformation
Raul Goycoolea Seoane
 
PPTX
Big Data Roundtable. Why, how, where, which, and when to start doing Big Data
Raul Goycoolea Seoane
 
PPTX
Best Practices for Development Apps for Big Data
Raul Goycoolea Seoane
 
PPTX
Deliver World Class Customer Experience with Big Data and Analytics
Raul Goycoolea Seoane
 
Xertica work transformation with Google
Raul Goycoolea Seoane
 
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Raul Goycoolea Seoane
 
Transformacion Digital con Cloud Computing
Raul Goycoolea Seoane
 
Cloud Digital Transformation
Raul Goycoolea Seoane
 
Big Data Roundtable. Why, how, where, which, and when to start doing Big Data
Raul Goycoolea Seoane
 
Best Practices for Development Apps for Big Data
Raul Goycoolea Seoane
 
Deliver World Class Customer Experience with Big Data and Analytics
Raul Goycoolea Seoane
 
Ad

Recently uploaded(20)

PDF
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPTX
Lifting and Rigging Safety AQG-2025-2.pptx
farrukhkhan658034
 
PDF
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
PDF
Rethinking Security Operations - Modern SOC.pdf
Haris Chughtai
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PDF
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
PDF
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
Upskill to Agentic Automation 2025 - Kickoff Meeting
DianaGray10
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Lifting and Rigging Safety AQG-2025-2.pptx
farrukhkhan658034
 
Are there government-backed agri-software initiatives in Limerick.pdf
giselawagner2
 
Rethinking Security Operations - Modern SOC.pdf
Haris Chughtai
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
2025-07-15 EMEA Volledig Inzicht Dutch Webinar
ThousandEyes
 
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 

Multiprocessor architecture and programming

  • 1.Parallel Computing Architecture &Programming TechniquesRaul Goycoolea S.Solution Architect ManagerOracle Enterprise Architecture Group
  • 2.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 216 February 2012
  • 4.The “Software Crisis”“To put it quite bluntly: as long as there were nomachines, programming was no problem at all; whenwe had a few weak computers, programming became amild problem, and now we have gigantic computers,programming has become an equally gigantic problem."-- E. Dijkstra, 1972 Turing Award LectureRaul Goycoolea S.Multiprocessor Programming 416 February 2012
  • 5.The First Software Crisis• Time Frame: ’60s and ’70s• Problem: Assembly Language ProgrammingComputers could handle larger more complex programs• Needed to get Abstraction and Portability withoutlosing PerformanceRaul Goycoolea S.Multiprocessor Programming 516 February 2012
  • 6.Common PropertiesSingle flow of controlSingle memory imageDifferences:Register FileISAFunctional UnitsHow Did We Solve The First SoftwareCrisis?• High-level languages for von-Neumann machinesFORTRAN and C• Provided “common machine language” foruniprocessorsRaul Goycoolea S.Multiprocessor Programming 616 February 2012
  • 7.The Second Software Crisis• Time Frame: ’80s and ’90s• Problem: Inability to build and maintain complex androbust applications requiring multi-million lines ofcode developed by hundreds of programmersComputers could handle larger more complex programs• Needed to get Composability, Malleability andMaintainabilityHigh-performance was not an issue left for Moore’s LawRaul Goycoolea S.Multiprocessor Programming 716 February 2012
  • 8.How Did We Solve the SecondSoftware Crisis?• Object Oriented ProgrammingC++, C# and Java• Also…Better tools• Component libraries, PurifyBetter software engineering methodology• Design patterns, specification, testing, codereviewsRaul Goycoolea S.Multiprocessor Programming 816 February 2012
  • 9.Today:Programmers are Oblivious to Processors• Solid boundary between Hardware and Software• Programmers don’t have to know anything about theprocessorHigh level languages abstract away the processorsEx: Java bytecode is machine independentMoore’s law does not require the programmers to know anythingabout the processors to get good speedups• Programs are oblivious of the processor works on allprocessorsA program written in ’70 using C still works and is much faster today• This abstraction provides a lot of freedom for theprogrammersRaul Goycoolea S.Multiprocessor Programming 916 February 2012
  • 10.The Origins of a Third Crisis• Time Frame: 2005 to 20??• Problem: Sequential performance is left behind byMoore’s law• Needed continuous and reasonable performanceimprovementsto support new featuresto support larger datasets• While sustaining portability, malleability andmaintainability without unduly increasing complexityfaced by the programmer critical to keep-up with thecurrent rate of evolution in softwareRaul Goycoolea S.Multiprocessor Programming 1016 February 2012
  • 11.Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 1116 February 2012
  • 12.Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 1216 February 2012Intel 386Intel 486
  • 13.The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 1316 February 2012
  • 14.Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 1416 February 2012
  • 15.Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 1516 February 2012
  • 16.Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 1616 February 2012
  • 17.Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 1716 February 2012
  • 18.PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 1816 February 2012
  • 19.Diminishing Returns• The ’80s: Superscalar expansion50% per year improvement in performanceTransistors applied to implicit parallelism- pipeline processor (10 CPI --> 1 CPI)• The ’90s: The Era of Diminishing ReturnsSqueaking out the last implicit parallelism2-way to 6-way issue, out-of-order issue, branch prediction1 CPI --> 0.5 CPIPerformance below expectations projects delayed & canceled• The ’00s: The Beginning of the Multicore EraThe need for Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 1916 February 2012
  • 20.Mit Raw16 Cores2002 Intel TanglewoodDual Core IA/64Intel DempseyDual Core XeonIntel Montecito1.7 Billion transistorsDual Core IA/64Intel Pentium D(Smithfield)CancelledIntel Tejas & JayhawkUnicore (4GHz P4)IBM Power 6Dual CoreIBM Power 4 and 5Dual Cores Since 2001Intel Pentium Extreme3.2GHz Dual CoreIntel YonahDual Core MobileAMD OpteronDual CoreSun Olympus and Niagara8 Processor CoresIBM CellScalable Multicore… 1H 2005 1H 2006 2H 20062H 20052H 2004Unicores are on extinctionNow all is multicore
  • 21.# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 2116 February 2012
  • 22.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 2216 February 2012
  • 24.Traditionally, software has been written for serial computation:• To be run on a single computer having a single Central Processing Unit (CPU)• A problem is broken into a discrete series of instructions• Instructions are executed one after another• Only one instruction may execute at any moment in timeWhat is Parallel Computing?Raul Goycoolea S.Multiprocessor Programming 2416 February 2012
  • 25.What is Parallel Computing?In the simplest sense, parallel computing is the simultaneous use of multiplecompute resources to solve a computational problem:• To be run using multiple CPUs• A problem is broken into discrete parts that can be solved concurrently• Each part is further broken down to a series of instructions• Instructions from each part execute simultaneously on different CPUsRaul Goycoolea S.Multiprocessor Programming 2516 February 2012
  • 26.Options in Parallel Computing?The compute resources might be:• A single computer with multiple processors;• An arbitrary number of computers connected by a network;• A combination of both.The computational problem should be able to:• Be broken apart into discrete pieces of work that can be solvedsimultaneously;• Execute multiple program instructions at any moment in time;• Be solved in less time with multiple compute resources than with asingle compute resource.Raul Goycoolea S.Multiprocessor Programming 2616 February 2012
  • 27.27
  • 28.The Real World is Massively Parallel• Parallel computing is an evolution of serial computing thatattempts to emulate what has always been the state ofaffairs in the natural world: many complex, interrelatedevents happening at the same time, yet within a sequence.For example:• Galaxy formation• Planetary movement• Weather and ocean patterns• Tectonic plate drift Rush hour traffic• Automobile assembly line• Building a jet• Ordering a hamburger at the drive through.Raul Goycoolea S.Multiprocessor Programming 2816 February 2012
  • 29.Architecture ConceptsVon Neumann Architecture• Named after the Hungarian mathematician John von Neumann who first authoredthe general requirements for an electronic computer in his 1945 papers• Since then, virtually all computers have followed this basic design, differing fromearlier computers which were programmed through "hard wiring”• Comprised of four main components:• Memory• Control Unit• Arithmetic Logic Unit• Input/Output• Read/write, random access memory is used to storeboth program instructions and data• Program instructions are coded data which tellthe computer to do something• Data is simply information to be used by theprogram• Control unit fetches instructions/data from memory, decodesthe instructions and then sequentially coordinates operationsto accomplish the programmed task.• Aritmetic Unit performs basic arithmetic operations• Input/Output is the interface to the human operatorRaul Goycoolea S.Multiprocessor Programming 2916 February 2012
  • 30.Flynn’s Taxonomy• There are different ways to classify parallel computers. One of the morewidely used classifications, in use since 1966, is called Flynn'sTaxonomy.• Flynn's taxonomy distinguishes multi-processor computer architecturesaccording to how they can be classified along the two independentdimensions of Instruction and Data. Each of these dimensions canhave only one of two possible states: Single or Multiple.• The matrix below defines the 4 possible classifications according toFlynn:Raul Goycoolea S.Multiprocessor Programming 3016 February 2012
  • 31.Single Instruction, Single Data (SISD):• A serial (non-parallel) computer• Single Instruction: Only one instruction stream isbeing acted on by the CPU during any one clockcycle• Single Data: Only one data stream is being usedas input during any one clock cycle• Deterministic execution• This is the oldest and even today, the mostcommon type of computer• Examples: older generation mainframes,minicomputers and workstations; most modernday PCs.Raul Goycoolea S.Multiprocessor Programming 3116 February 2012
  • 32.Single Instruction, Single Data (SISD):Raul Goycoolea S.Multiprocessor Programming 3216 February 2012
  • 33.Single Instruction, Multiple Data(SIMD):• A type of parallel computer• Single Instruction: All processing units execute the same instruction at anygiven clock cycle• Multiple Data: Each processing unit can operate on a different data element• Best suited for specialized problems characterized by a high degree ofregularity, such as graphics/image processing.• Synchronous (lockstep) and deterministic execution• Two varieties: Processor Arrays and Vector Pipelines• Examples:• Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,ETA10• Most modern computers, particularly those with graphics processor units(GPUs) employ SIMD instructions and execution units.Raul Goycoolea S.Multiprocessor Programming 3316 February 2012
  • 34.Single Instruction, Multiple Data(SIMD):ILLIAC IV MasPar TM CM-2 Cell GPUCray X-MP Cray Y-MPRaul Goycoolea S.Multiprocessor Programming 3416 February 2012
  • 35.• A type of parallel computer• Multiple Instruction: Each processing unit operates on the dataindependently via separate instruction streams.• Single Data: A single data stream is fed into multiple processingunits.• Few actual examples of this class of parallel computer have everexisted. One is the experimental Carnegie-Mellon C.mmp computer(1971).• Some conceivable uses might be:• multiple frequency filters operating on a single signal stream• multiple cryptography algorithms attempting to crack a single codedmessage.Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3516 February 2012
  • 36.Multiple Instruction, Single Data(MISD):Raul Goycoolea S.Multiprocessor Programming 3616 February 2012
  • 37.• A type of parallel computer• Multiple Instruction: Every processor may be executing a differentinstruction stream• Multiple Data: Every processor may be working with a differentdata stream• Execution can be synchronous or asynchronous, deterministic ornon-deterministic• Currently, the most common type of parallel computer - mostmodern supercomputers fall into this category.• Examples: most current supercomputers, networked parallelcomputer clusters and "grids", multi-processor SMP computers,multi-core PCs.Note: many MIMD architectures also include SIMD execution sub-componentsMultiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3716 February 2012
  • 38.Multiple Instruction, Multiple Data(MIMD):Raul Goycoolea S.Multiprocessor Programming 3816 February 2012
  • 39.Multiple Instruction, Multiple Data(MIMD):IBM Power HP Alphaserver Intel IA32/x64Oracle SPARC Cray XT3 Oracle Exadata/ExalogicRaul Goycoolea S.Multiprocessor Programming 3916 February 2012
  • 40.Parallel Computer Memory ArchitectureShared MemoryShared memory parallel computers vary widely, but generally have in common theability for all processors to access all memory as global address space.Multiple processors can operate independently but share the same memoryresources.Changes in a memory location effected by one processor are visible to all otherprocessors.Shared memory machines can be divided into two main classes based uponmemory access times: UMA and NUMA.Uniform Memory Access (UMA):• Most commonly represented today by Symmetric Multiprocessor (SMP) machines• Identical processorsNon-Uniform Memory Access (NUMA):• Often made by physically linking two or more SMPs• One SMP can directly access memory of another SMP40Raul Goycoolea S.Multiprocessor Programming 4016 February 2012
  • 41.Parallel Computer Memory ArchitectureShared Memory41Shared Memory (UMA) Shared Memory (NUMA)Raul Goycoolea S.Multiprocessor Programming 4116 February 2012
  • 42.Basic structure of a centralizedshared-memory multiprocessorProcessor Processor Processor ProcessorOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheOne or morelevels of CacheMultiple processor-cache subsystems share the same physical memory, typically connected by a bus.In larger designs, multiple buses, or even a switch may be used, but the key architectural property: uniformaccess time o all memory from all processors remains.Raul Goycoolea S.Multiprocessor Programming 4216 February 2012
  • 43.Processor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryProcessor+ CacheI/OMemoryInterconnection NetworkBasic Architecture of a DistributedMultiprocessorConsists of individual nodes containing a processor, some memory, typically some I/O, and an interface to aninterconnection network that connects all the nodes. Individual nodes may contain a small number ofprocessors, which may be interconnected by a small bus or a different interconnection technology, which is lessscalable than the global interconnection network.Raul Goycoolea S.Multiprocessor Programming 4316 February 2012
  • 44.Communicationhow do parallel operations communicate data results?Synchronizationhow are parallel operations coordinated?Resource Managementhow are a large number of parallel tasks scheduled ontofinite hardware?Scalabilityhow large a machine can be built?Issues in Parallel Machine DesignRaul Goycoolea S.Multiprocessor Programming 4416 February 2012
  • 45.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 4516 February 2012
  • 47.ExplicitImplicitHardware CompilerSuperscalarProcessorsExplicitly Parallel ArchitecturesImplicit vs. Explicit ParallelismRaul Goycoolea S.Multiprocessor Programming 4716 February 2012
  • 48.Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 4816 February 2012
  • 49.Issue varying numbers of instructions per clockstatically scheduled––using compiler techniquesin-order executiondynamically scheduled–––––Extracting ILP by examining 100‟s of instructionsScheduling them in parallel as operands become availableRename registers to eliminate anti dependencesout-of-order executionSpeculative executionImplicit Parallelism: SuperscalarProcessorsRaul Goycoolea S.Multiprocessor Programming 4916 February 2012
  • 50.Instruction i IF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBIF ID EX WBInstruction i+1Instruction i+2Instruction i+3Instruction i+4Instruction # 1 2 3 4 5 6 7 8IF: Instruction fetchEX : ExecutionCyclesID : Instruction decodeWB : Write backPipelining ExecutionRaul Goycoolea S.Multiprocessor Programming 5016 February 2012
  • 51.Instruction type 1 2 3 4 5 6 7CyclesIntegerFloating pointIFIFIDIDEXEXWBWBIntegerFloating pointIntegerFloating pointIntegerFloating pointIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWBIFIFIDIDEXEXWBWB2-issue super-scalar machineSuper-Scalar ExecutionRaul Goycoolea S.Multiprocessor Programming 5116 February 2012
  • 52.Intrinsic data dependent (aka true dependence) on Instructions:I: add r1,r2,r3J: sub r4,r1,r3If two instructions are data dependent, they cannot executesimultaneously, be completely overlapped or execute in out-of-orderIf data dependence caused a hazard in pipeline,called a Read After Write (RAW) hazardData Dependence and HazardsRaul Goycoolea S.Multiprocessor Programming 5216 February 2012
  • 53.HW/SW must preserve program order:order instructions would execute in if executed sequentially asdetermined by original source programDependences are a property of programsImportance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possiblybe exploitedGoal: exploit parallelism by preserving program order onlywhere it affects the outcome of the programILP and Data Dependencies, HazardsRaul Goycoolea S.Multiprocessor Programming 5316 February 2012
  • 54.Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data betweenthe instructions associated with that name; 2 versions ofname dependenceInstrJ writes operand before InstrIreads itI: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Read (WAR) hazardName Dependence #1: Anti-dependeceRaul Goycoolea S.Multiprocessor Programming 5416 February 2012
  • 55.Instruction writes operand before InstrIwrites it.I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “output dependence” by compiler writers.This also results from the reuse of name “r1”If anti-dependence caused a hazard in the pipeline, called aWrite After Write (WAW) hazardInstructions involved in a name dependence can executesimultaneously if name used in instructions is changed soinstructions do not conflictRegister renaming resolves name dependence for registersRenaming can be done either by compiler or by HWName Dependence #1: OutputDependenceRaul Goycoolea S.Multiprocessor Programming 5516 February 2012
  • 56.Every instruction is control dependent on some set ofbranches, and, in general, these control dependencies mustbe preserved to preserve program orderif p1 {S1;};if p2 {S2;}S1 is control dependent on p1, and S2 is control dependenton p2 but not on p1.Control dependence need not be preservedwilling to execute instructions that should not have beenexecuted, thereby violating the control dependences, if cando so without affecting correctness of the programSpeculative ExecutionControl DependenciesRaul Goycoolea S.Multiprocessor Programming 5616 February 2012
  • 57.Greater ILP: Overcome control dependence by hardwarespeculating on outcome of branches and executingprogram as if guesses were correctSpeculation ⇒ fetch, issue, and executeinstructions as if branch predictions were alwayscorrectDynamic scheduling ⇒ only fetches and issuesinstructionsEssentially a data flow execution model: Operationsexecute as soon as their operands are availableSpeculationRaul Goycoolea S.Multiprocessor Programming 5716 February 2012
  • 58.Different predictorsBranch PredictionValue PredictionPrefetching (memory access pattern prediction)InefficientPredictions can go wrongHas to flush out wrongly predicted dataWhile not impacting performance, it consumes powerSpeculation in Rampant in ModernSuperscalarsRaul Goycoolea S.Multiprocessor Programming 5816 February 2012
  • 59.Implicit Parallelism: Superscalar ProcessorsExplicit ParallelismShared Instruction ProcessorsShared Sequencer ProcessorsShared Network ProcessorsShared Memory ProcessorsMulticore ProcessorsOutlineRaul Goycoolea S.Multiprocessor Programming 5916 February 2012
  • 60.Parallelism is exposed to softwareCompiler or ProgrammerMany different formsLoosely coupled Multiprocessors to tightly coupled VLIWExplicit Parallel ProcessorsRaul Goycoolea S.Multiprocessor Programming 6016 February 2012
  • 61.Throughput per CycleOne OperationLatency in CyclesParallelism = Throughput * LatencyTo maintain throughput T/cycle when each operation haslatency L cycles, need T*L independent operationsFor fixed parallelism:decreased latency allows increased throughputdecreased throughput allows increased latency toleranceLittle’s LawRaul Goycoolea S.Multiprocessor Programming 6116 February 2012
  • 62.TimeTimeTimeTimeData-Level Parallelism (DLP)Instruction-Level Parallelism (ILP)PipeliningThread-Level Parallelism (TLP)Types of Software ParallelismRaul Goycoolea S.Multiprocessor Programming 6216 February 2012
  • 63.PipeliningThreadParallelDataParallelInstructionParallelTranslating Parallelism TypesRaul Goycoolea S.Multiprocessor Programming 6316 February 2012
  • 64.What is a sequential program?A single thread of control that executes one instruction and when it isfinished execute the next logical instructionWhat is a concurrent program?A collection of autonomous sequential threads, executing (logically) inparallelThe implementation (i.e. execution) of a collection of threads can be:Multiprogramming– Threads multiplex their executions on a single processor.Multiprocessing– Threads multiplex their executions on a multiprocessor or a multicore systemDistributed Processing– Processes multiplex their executions on several different machinesWhat is concurrency?Raul Goycoolea S.Multiprocessor Programming 6416 February 2012
  • 65.Concurrency is not (only) parallelismInterleaved ConcurrencyLogically simultaneous processingInterleaved execution on a singleprocessorParallelismPhysically simultaneous processingRequires a multiprocessors or amulticore systemTimeTimeABCABCConcurrency and ParallelismRaul Goycoolea S.Multiprocessor Programming 6516 February 2012
  • 66.There are a lot of ways to use Concurrency inProgrammingSemaphoresBlocking & non-blocking queuesConcurrent hash mapsCopy-on-write arraysExchangersBarriersFuturesThread pool supportOther Types of SynchronizationRaul Goycoolea S.Multiprocessor Programming 6616 February 2012
  • 67.DeadlockTwo or more threads stop and wait for each otherLivelockTwo or more threads continue to execute, but makeno progress toward the ultimate goalStarvationSome thread gets deferred foreverLack of fairnessEach thread gets a turn to make progressRace ConditionSome possible interleaving of threads results in anundesired computation resultPotential Concurrency ProblemsRaul Goycoolea S.Multiprocessor Programming 6716 February 2012
  • 68.Concurrency and Parallelism are important conceptsin Computer ScienceConcurrency can simplify programmingHowever it can be very hard to understand and debugconcurrent programsParallelism is critical for high performanceFrom Supercomputers in national labs toMulticores and GPUs on your desktopConcurrency is the basis for writing parallel programsNext Lecture: How to write a Parallel ProgramParallelism ConclusionsRaul Goycoolea S.Multiprocessor Programming 6816 February 2012
  • 69.Shared memory––––Ex: Intel Core 2 Duo/QuadOne copy of data sharedamong many coresAtomicity, locking andsynchronizationessential for correctnessMany scalability issuesDistributed memory––––Ex: CellCores primarily access localmemoryExplicit data exchangebetween coresData distribution andcommunication orchestrationis essential for performanceP1 P2 P3 PnMemoryInterconnection NetworkInterconnection NetworkP1 P2 P3 PnM1 M2 M3 MnTwo primary patterns of multicore architecture designArchitecture RecapRaul Goycoolea S.Multiprocessor Programming 6916 February 2012
  • 70.Processor 1…n ask for XThere is only one place to lookCommunication throughshared variablesRace conditions possibleUse synchronization to protect from conflictsChange how data is stored to minimize synchronizationP1 P2 P3 PnMemoryxInterconnection NetworkProgramming Shared Memory ProcessorsRaul Goycoolea S.Multiprocessor Programming 7016 February 2012
  • 71.Data parallelPerform same computationbut operate on different dataA single process can forkmultiple concurrent threadsEach thread encapsulate its own execution pathEach thread has local state and shared resourcesThreads communicate through shared resourcessuch as global memoryfor (i = 0; i < 12; i++)C[i] = A[i] + B[i];i=0i=1i=2i=3i=8i=9i = 10i = 11i=4i=5i=6i=7join (barrier)fork (threads)Example of ParallelizationRaul Goycoolea S.Multiprocessor Programming 7116 February 2012
  • 72.int A[12] = {...}; int B[12] = {...}; int C[12];void add_arrays(int start){int i;for (i = start; i < start + 4; i++)C[i] = A[i] + B[i];}int main (int argc, char *argv[]){pthread_t threads_ids[3];int rc, t;for(t = 0; t < 4; t++) {rc = pthread_create(&thread_ids[t],NULL /* attributes */,add_arrays /* function */,t * 4 /* args to function */);}pthread_exit(NULL);}join (barrier)i=0i=1i=2i=3i=4i=5i=6i=7i=8i=9i = 10i = 11fork (threads)Example Parallelization with ThreadsRaul Goycoolea S.Multiprocessor Programming 7216 February 2012
  • 73.Data parallelismPerform same computationbut operate on different dataControl parallelismPerform different functionsfork (threads)join (barrier)pthread_create(/* thread id */,/* attributes */,/*/*any functionargs to function*/,*/);Types of ParallelismRaul Goycoolea S.Multiprocessor Programming 7316 February 2012
  • 74.Uniform Memory Access (UMA)Centrally located memoryAll processors are equidistant (access times)Non-Uniform Access (NUMA)Physically partitioned but accessible by allProcessors have the same address spacePlacement of data affects performanceMemory Access Latency in SharedMemory ArchitecturesRaul Goycoolea S.Multiprocessor Programming 7416 February 2012
  • 75.Coverage or extent of parallelism in algorithmGranularity of data partitioning among processorsLocality of computation and communication… so how do I parallelize my program?Summary of Parallel PerformanceFactorsRaul Goycoolea S.Multiprocessor Programming 7516 February 2012
  • 76.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 7616 February 2012
  • 78.P0Tasks Processes ProcessorsP1P2 P3p0 p1p2 p3p0 p1p2 p3PartitioningSequentialcomputationParallelprogramdecompositionassignmentorchestrationmappingCommon Steps to Create a ParallelProgram
  • 79.Identify concurrency and decide at what level toexploit itBreak up computation into tasks to be dividedamong processesTasks may become available dynamicallyNumber of tasks may vary with timeEnough tasks to keep processors busyNumber of tasks available at a time is upper bound onachievable speedupDecomposition (Amdahl’s Law)
  • 80.Specify mechanism to divide work among coreBalance work and reduce communicationStructured approaches usually work wellCode inspection or understanding of applicationWell-known design patternsAs programmers, we worry about partitioning firstIndependent of architecture or programming modelBut complexity often affect decisions!Granularity
  • 81.Computation and communication concurrencyPreserve locality of dataSchedule tasks to satisfy dependences earlyOrchestration and Mapping
  • 82.Provides a cookbook to systematically guide programmersDecompose, Assign, Orchestrate, MapCan lead to high quality solutions in some domainsProvide common vocabulary to the programming communityEach pattern has a name, providing a vocabulary fordiscussing solutionsHelps with software reusability, malleability, and modularityWritten in prescribed format to allow the reader toquickly understand the solution and its contextOtherwise, too difficult for programmers, and software will notfully exploit parallel hardwareParallel Programming by Pattern
  • 83.Berkeley architecture professorChristopher AlexanderIn 1977, patterns for cityplanning, landscaping, andarchitecture in an attempt tocapture principles for “living”designHistory
  • 85.Design Patterns: Elements of Reusable Object-Oriented Software (1995)Gang of Four (GOF): Gamma, Helm, Johnson, VlissidesCatalogue of patternsCreation, structural, behavioralPatterns in Object-OrientedProgramming
  • 86.Algorithm ExpressionFinding ConcurrencyExpose concurrent tasksAlgorithm StructureMap tasks to processes toexploit parallel architecture4 Design SpacesSoftware ConstructionSupporting StructuresCode and data structuringpatternsImplementation MechanismsLow level mechanisms usedto write parallel programsPatterns for ParallelProgramming. Mattson,Sanders, and Massingill(2005).Patterns for Parallelizing Programs
  • 87.splitfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationspatially encoded macroblocksdifferentially codedmotion vectorsMotion Vector DecodeRepeatmotion vectorsMPEG bit streamVLDmacroblocks, motion vectorsMPEG DecoderjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?
  • 88.Task decompositionIndependent coarse-grainedcomputationInherent to algorithmSequence of statements(instructions) that operatetogether as a groupCorresponds to some logicalpart of programUsually follows from the wayprogrammer thinks about aproblemjoinmotion vectorsspatially encoded macroblocksIDCTSaturationMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayHere’s my algorithm, Where’s theconcurrency?
  • 89.joinmotion vectorsSaturationspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation is appliedto small data chunks derivedfrom large data setHere’s my algorithm, Where’s theconcurrency?
  • 90.motion vectorsspatially encoded macroblocksMPEG Decoderfrequency encodedmacroblocksZigZagIQuantizationIDCTSaturationjoinMotionCompensationrecovered picturePicture ReorderColor ConversionDisplayMPEG bit streamVLDmacroblocks, motion vectorssplitdifferentially codedmotion vectorsMotion Vector DecodeRepeatTask decompositionParallelism in the applicationData decompositionSame computation many dataPipeline decompositionData assembly linesProducer-consumer chainsHere’s my algorithm, Where’s theconcurrency?
  • 91.Algorithms start with a good understanding of theproblem being solvedPrograms often naturally decompose into tasksTwo common decompositions are––Function calls andDistinct loop iterationsEasier to start with many tasks and later fuse them,rather than too few tasks and later try to split themGuidelines for Task Decomposition
  • 92.FlexibilityProgram design should afford flexibility in the number andsize of tasks generated––Tasks should not tied to a specific architectureFixed tasks vs. Parameterized tasksEfficiencyTasks should have enough work to amortize the cost ofcreating and managing themTasks should be sufficiently independent so that managingdependencies doesn‟t become the bottleneckSimplicityThe code has to remain readable and easy to understand,and debugGuidelines for Task Decomposition
  • 93.Data decomposition is often implied by taskdecompositionProgrammers need to address task and datadecomposition to create a parallel programWhich decomposition to start with?Data decomposition is a good starting point whenMain computation is organized around manipulation of alarge data structureSimilar operations are applied to different parts of thedata structureGuidelines for Data DecompositionRaul Goycoolea S.Multiprocessor Programming 9316 February 2012
  • 94.Array data structuresDecomposition of arrays along rows, columns, blocksRecursive data structuresExample: decomposition of trees into sub-treesproblemcomputesubproblemcomputesubproblemcomputesubproblemcomputesubproblemmergesubproblemmergesubproblemmergesolutionsubproblemsplitsubproblemsplitsplitCommon Data DecompositionsRaul Goycoolea S.Multiprocessor Programming 9416 February 2012
  • 95.FlexibilitySize and number of data chunks should support a widerange of executionsEfficiencyData chunks should generate comparable amounts ofwork (for load balancing)SimplicityComplex data compositions can get difficult to manageand debugRaul Goycoolea S.Multiprocessor Programming 9516 February 2012Guidelines for Data Decompositions
  • 96.Data is flowing through a sequence of stagesAssembly line is a good analogyWhat’s a prime example of pipeline decomposition incomputer architecture?Instruction pipeline in modern CPUsWhat’s an example pipeline you may use in your UNIX shell?Pipes in UNIX: cat foobar.c | grep bar | wcOther examplesSignal processingGraphicsZigZagIQuantizationIDCTSaturationGuidelines for Pipeline DecompositionRaul Goycoolea S.Multiprocessor Programming 9616 February 2012
  • 97.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 9716 February 2012
  • 99.Coverage or extent of parallelism in algorithmAmdahl‟s LawGranularity of partitioning among processorsCommunication cost and load balancingLocality of computation and communicationCommunication between processors or betweenprocessors and their memoriesReview: Keys to Parallel Performance
  • 100.n/mBt overlap)C f (o lfrequencyof messagesoverhead permessage(at both ends)network delayper messagenumber of messagesamount of latencyhidden by concurrencywith computationtotal data sentcost induced bycontention permessagebandwidth along path(determined by network)Communication Cost Model
  • 101.synchronizationpointGet DataComputeGet DataCPU is idleMemory is idleComputeOverlapping Communication withComputation
  • 102.Computation to communication ratio limitsperformance gains from pipeliningGet DataComputeGet DataComputeWhere else to look for performance?Limits in Pipelining Communication
  • 103.Determined by program implementation andinteractions with the architectureExamples:Poor distribution of data across distributed memoriesUnnecessarily fetching data that is not usedRedundant data fetchesArtifactual Communication
  • 104.In uniprocessors, CPU communicates with memoryLoads and stores are to uniprocessors as_______ and ______ are to distributed memorymultiprocessorsHow is communication overlap enhanced inuniprocessors?Spatial localityTemporal locality“get” “put”Lessons From Uniprocessors
  • 105.CPU asks for data at address 1000Memory sends data at address 1000 … 1064Amount of data sent depends on architectureparameters such as the cache block sizeWorks well if CPU actually ends up using data from1001, 1002, …, 1064Otherwise wasted bandwidth and cache capacitySpatial Locality
  • 106.Main memory access is expensiveMemory hierarchy adds small but fast memories(caches) near the CPUMemories get bigger as distancefrom CPU increasesCPU asks for data at address 1000Memory hierarchy anticipates more accesses to sameaddress and stores a local copyWorks well if CPU actually ends up using data from 1000 overand over and over …Otherwise wasted cache capacitymainmemorycache(level 2)cache(level 1)Temporal Locality
  • 107.Data is transferred in chunks to amortizecommunication costCell: DMA gets up to 16KUsually get a contiguous chunk of memorySpatial localityComputation should exhibit good spatial localitycharacteristicsTemporal localityReorder computation to maximize use of data fetchedReducing Artifactual Costs inDistributed Memory Architectures
  • 108.Tasks mapped to execution units (threads)Threads run on individual processors (cores)finish line: sequential time + longest parallel timeTwo keys to faster executionLoad balance the work among the processorsMake execution on each processor fastersequentialparallelsequentialparallelSingle Thread Performance
  • 109.Need some way ofmeasuring performanceCoarse grainedmeasurements% gcc sample.c% time a.out2.312u 0.062s 0:02.50 94.8%% gcc sample.c –O3% time a.out1.921u 0.093s 0:02.03 99.0%… but did we learn muchabout what’s going on?#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {a[i] = 0;}}int main() {double s=0,s2=0; int i,j;for (j = 0; j < T; j++) {for (i = 0; i < N; i++) {b[i] = 0;}cleara(a);memset(a,0,sizeof(a));for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];}}printf("s %f s2 %fn",s,s2);}record stop timerecord start timeUnderstanding Performance
  • 110.Increasingly possible to get accurate measurementsusing performance countersSpecial registers in the hardware to measure eventsInsert code to start, read, and stop counterMeasure exactly what you want, anywhere you wantCan measure communication and computation durationBut requires manual changesMonitoring nested scopes is an issueHeisenberg effect: counters can perturb execution timetimestopclear/startMeasurements Using Counters
  • 111.Event-based profilingInterrupt execution when an event counter reaches athresholdTime-based profilingInterrupt execution every t secondsWorks without modifying your codeDoes not require that you know where problem might beSupports multiple languages and programming modelsQuite efficient for appropriate sampling frequenciesDynamic Profiling
  • 112.Cycles (clock ticks)Pipeline stallsCache hitsCache missesNumber of instructionsNumber of loadsNumber of storesNumber of floating point operations…Counter Examples
  • 113.Processor utilizationCycles / Wall Clock TimeInstructions per cycleInstructions / CyclesInstructions per memory operationInstructions / Loads + StoresAverage number of instructions per load missInstructions / L1 Load MissesMemory trafficLoads + Stores * Lk Cache Line SizeBandwidth consumedLoads + Stores * Lk Cache Line Size / Wall Clock TimeMany othersCache miss rateBranch misprediction rate…Useful Derived Measurements
  • 115.GNU gprofWidely available with UNIX/Linux distributionsgcc –O2 –pg foo.c –o foo./foogprof fooHPC Toolkithttp://www.hipersoft.rice.edu/hpctoolkit/PAPIhttp://icl.cs.utk.edu/papi/VTunehttp://www.intel.com/cd/software/products/asmo-na/eng/vtune/Many othersPopular Runtime Profiling Tools
  • 116.Instruction level parallelismMultiple functional units, deeply pipelined, speculation, ...Data level parallelismSIMD (Single Inst, Multiple Data): short vector instructions(multimedia extensions)–––Hardware is simpler, no heavily ported register filesInstructions are more compactReduces instruction fetch bandwidthComplex memory hierarchiesMultiple level caches, may outstanding misses,prefetching, …Performance un Uniprocessorstime = compute + wait
  • 117.Single Instruction, Multiple DataSIMD registers hold short vectorsInstruction operates on all elements in SIMD register at onceabcVector codefor (int i = 0; i < n; i += 4) {c[i:i+3] = a[i:i+3] + b[i:i+3]}SIMD registerScalar codefor (int i = 0; i < n; i+=1) {c[i] = a[i] + b[i]}abcscalar registerSingle Instruction, Multiple Data
  • 118.For Example CellSPU has 128 128-bit registersAll instructions are SIMD instructionsRegisters are treated as short vectors of 8/16/32-bitintegers or single/double-precision floatsInstruction SetAltiVecMMX/SSE3DNow!VISMAX2MVIMDMXArchitecturePowerPCIntelAMDSunHPAlphaMIPS VSIMD Width12864/1286464646464Floating PointyesyesyesnononoyesSIMD in Major Instruction SetArchitectures (ISAs)
  • 119.Library calls and inline assemblyDifficult to programNot portableDifferent extensions to the same ISAMMX and SSESSE vs. 3DNow!Compiler vs. Crypto Oracle T4Using SIMD Instructions
  • 120.Tune the parallelism firstThen tune performance on individual processorsModern processors are complexNeed instruction level parallelism for performanceUnderstanding performance requires a lot of probingOptimize for the memory hierarchyMemory is much slower than processorsMulti-layer memory hierarchies try to hide the speed gapData locality is essential for performanceProgramming for Performance
  • 121.May have to change everything!Algorithms, data structures, program structureFocus on the biggest performance impedimentsToo many issues to study everythingRemember the law of diminishing returnsProgramming for Performance
  • 122.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Actual Cases• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 12216 February 2012
  • 124.Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationCompilers OutlineRaul Goycoolea S.Multiprocessor Programming 12416 February 2012
  • 125.Instruction Level Parallelism(ILP)Task Level Parallelism (TLP)Loop Level Parallelism (LLP)or Data ParallelismPipeline ParallelismDivide and ConquerParallelismScheduling and HardwareMainly by handHand or Compiler GeneratedHardware or StreamingRecursive functionsTypes of ParallelismRaul Goycoolea S.Multiprocessor Programming 12516 February 2012
  • 126.90% of the execution time in 10% of the codeMostly in loopsIf parallel, can get good performanceLoad balancingRelatively easy to analyzeWhy Loops?Raul Goycoolea S.Multiprocessor Programming 12616 February 2012
  • 127.FORALLNo “loop carrieddependences”Fully parallelFORACROSSSome “loop carrieddependences”Programmer Defined Parallel LoopRaul Goycoolea S.Multiprocessor Programming 12716 February 2012
  • 128.Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 12816 February 2012
  • 129.Finding FORALL Loops out of FOR loopsExamplesFOR I = 0 to 5A[I+1] = A[I] + 1FOR I = 0 to 5A[I] = A[I+6] + 1For I = 0 to 5A[2*I] = A[2*I + 1] + 1Parallelizing CompilersRaul Goycoolea S.Multiprocessor Programming 12916 February 2012
  • 130.True dependencea == aAnti dependence= aa =Output dependenceaa==Definition:Data dependence exists for a dynamic instance i and j iffeither i or j is a write operationi and j refer to the same variablei executes before jHow about array accesses within loops?DependencesRaul Goycoolea S.Multiprocessor Programming 13016 February 2012
  • 131.Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 13116 February 2012
  • 132.FOR I = 0 to 5A[I] = A[I] + 10 1 2Iteration Space0 1 2 3 4 5Data Space3 4 5 6 7 8 9 10 11 12A[I]A[I]A[I]A[I]A[I]= A[I]= A[I]= A[I]= A[I]= A[I]Array Access in a LoopRaul Goycoolea S.Multiprocessor Programming 13216 February 2012
  • 133.Find data dependences in loopFor every pair of array acceses to the same arrayIf the first access has at least one dynamic instance (an iteration) inwhich it refers to a location in the array that the second access alsorefers to in at least one of the later dynamic instances (iterations).Then there is a data dependence between the statements(Note that same array can refer to itself – output dependences)DefinitionLoop-carried dependence:dependence that crosses a loop boundaryIf there are no loop carried dependences are parallelizableRecognizing FORALL LoopsRaul Goycoolea S.Multiprocessor Programming 13316 February 2012
  • 134.FOR I = 1 to nFOR J = 1 to nA[I, J] = A[I-1, J+1] + 1FOR I = 1 to nFOR J = 1 to nA[I] = A[I-1] + 1JJIIWhat is the Dependence?Raul Goycoolea S.Multiprocessor Programming 13416 February 2012
  • 135.Parallel ExecutionParallelizing CompilersDependence AnalysisIncreasing Parallelization OpportunitiesGeneration of Parallel LoopsCommunication Code GenerationOutlineRaul Goycoolea S.Multiprocessor Programming 13516 February 2012
  • 136.Scalar PrivatizationReduction RecognitionInduction Variable IdentificationArray PrivatizationInterprocedural ParallelizationLoop TransformationsGranularity of ParallelismIncreasing ParallelizationOpportunitiesRaul Goycoolea S.Multiprocessor Programming 13616 February 2012
  • 137.ExampleFOR i = 1 to nX = A[i] * 3;B[i] = X;Is there a loop carried dependence?What is the type of dependence?Scalar PrivatizationRaul Goycoolea S.Multiprocessor Programming 13716 February 2012
  • 138.Reduction Analysis:Only associative operationsThe result is never used within the loopTransformationInteger Xtmp[NUMPROC];Barrier();FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)Xtmp[myPid] = Xtmp[myPid] + A[i];Barrier();If(myPid == 0) {FOR p = 0 to NUMPROC-1X = X + Xtmp[p];…Reduction RecognitionRaul Goycoolea S.Multiprocessor Programming 13816 February 2012
  • 139.ExampleFOR i = 0 to NA[i] = 2^i;After strength reductiont = 1FOR i = 0 to NA[i] = t;t = t*2;What happened to loop carried dependences?Need to do opposite of this!Perform induction variable analysisRewrite IVs as a function of the loop variableInduction VariablesRaul Goycoolea S.Multiprocessor Programming 13916 February 2012
  • 140.Similar to scalar privatizationHowever, analysis is more complexArray Data Dependence Analysis:Checks if two iterations access the same locationArray Data Flow Analysis:Checks if two iterations access the same valueTransformationsSimilar to scalar privatizationPrivate copy for each processor or expand with an additionaldimensionArray PrivatizationRaul Goycoolea S.Multiprocessor Programming 14016 February 2012
  • 141.Function calls will make a loop unparallelizatbleReduction of available parallelismA lot of inner-loop parallelismSolutionsInterprocedural AnalysisInliningInterprocedural ParallelizationRaul Goycoolea S.Multiprocessor Programming 14116 February 2012
  • 142.Cache Coherent Shared Memory MachineGenerate code for the parallel loop nestNo Cache Coherent Shared Memoryor Distributed Memory MachinesGenerate code for the parallel loop nestIdentify communicationGenerate communication codeCommunication Code GenerationRaul Goycoolea S.Multiprocessor Programming 14216 February 2012
  • 143.Eliminating redundant communicationCommunication aggregationMulti-cast identificationLocal memory managementCommunication OptimizationsRaul Goycoolea S.Multiprocessor Programming 14316 February 2012
  • 144.Automatic parallelization of loops with arraysRequires Data Dependence AnalysisIteration space & data space abstractionAn integer programming problemMany optimizations that’ll increase parallelismTransforming loop nests and communication code generationFourier-Motzkin Elimination provides a nice frameworkSummaryRaul Goycoolea S.Multiprocessor Programming 14416 February 2012
  • 145.<Insert Picture Here>Program Agenda• Antecedents of Parallel Computing• Introduction to Parallel Architectures• Parallel Programming Concepts• Parallel Design Patterns• Performance & Optimization• Parallel Compilers• Future of Parallel ArchitecturesRaul Goycoolea S.Multiprocessor Programming 14516 February 2012
  • 147."I think there is a world market formaybe five computers.“– Thomas Watson, chairman of IBM, 1949"There is no reason in the worldanyone would want a computer in theirhome. No reason.”– Ken Olsen, Chairman, DEC, 1977"640K of RAM ought to be enough foranybody.”– Bill Gates, 1981Predicting the Future is Always RiskyRaul Goycoolea S.Multiprocessor Programming 14716 February 2012
  • 148.EvolutionRelatively easy to predictExtrapolate the trendsRevolutionA completely new technology or solutionHard to PredictParadigm Shifts can occur in bothFuture = Evolution + RevolutionRaul Goycoolea S.Multiprocessor Programming 14816 February 2012
  • 149.EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 14916 February 2012
  • 150.Look at the trendsMoore‟s LawPower ConsumptionWire DelayHardware ComplexityParallelizing CompilersProgram Design MethodologiesDesign Drivers are different inDifferent GenerationsEvolutionRaul Goycoolea S.Multiprocessor Programming 15016 February 2012
  • 151.Performance(vs.VAX-11/780)NumberofTransistors52%/year1001000100001000001978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016%/year108086128625%/year386486PentiumP2P3P4Itanium 2Itanium1,000,000,000100,00010,0001,000,00010,000,000100,000,000From Hennessy and Patterson, Computer Architecture:A Quantitative Approach, 4th edition, 2006The Road to Multicore: Moore’s LawRaul Goycoolea S.Multiprocessor Programming 15116 February 2012
  • 152.Specint200010000.001000.00100.0010.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha 21064Alpha 21164Alpha 21264Spar cSuper Spar cSpar c64MipsHP PAPower PCAMD K6AMD K7AMD x86-64The Road to Multicore:Uniprocessor Performance (SPECint)Raul Goycoolea S.Multiprocessor Programming 15216 February 2012Intel 386Intel 486
  • 153.The Road to Multicore:Uniprocessor Performance (SPECint)General-purpose unicores have stopped historicperformance scalingPower consumptionWire delaysDRAM access latencyDiminishing returns of more instruction-level parallelismRaul Goycoolea S.Multiprocessor Programming 15316 February 2012
  • 154.Power100010010185 87 89 91 93 95 97 99 01 03 05 07Intel 386Intel 486intel pentiumintel pentium2intel pentium3intel pentium4intel itaniumAlpha21064Alpha21164Alpha21264SparcSuperSparcSparc64MipsHPPAPower PCAMDK6AMDK7AMDx86-64Power Consumption (watts)Raul Goycoolea S.Multiprocessor Programming 15416 February 2012
  • 155.Watts/Spec0.70.60.50.40.30.20.11982 1984 1987 1990 1993 1995 1998 2001 2004 2006Yearintel 386intel 486intel pentiumintel pentium 2intel pentium 3intel pentium 4intel itaniumAlpha 21064Alpha 21164Alpha 21264SparcSuperSparcSparc64MipsHP PAPower PCAMD K6AMD K7AMD x86-640Power Efficiency (watts/spec)Raul Goycoolea S.Multiprocessor Programming 15516 February 2012
  • 156.Process(microns)0.060.040.0200.260.240.220.20.180.160.140.120.10.081996 1998 2000 2002 2008 2010 2012 20142004 2006Year700 MHz1.25 GHz2.1 GHz6 GHz10 GHz13.5 GHz• 400 mm2 Die• From the SIA RoadmapRange of a Wire in One Clock CycleRaul Goycoolea S.Multiprocessor Programming 15616 February 2012
  • 157.Performance19841994199219821988198619801996199820002002199020041000000100001001YearµProc60%/yr.(2X/1.5yr)DRAM9%/yr.(2X/10 yrs)DRAM Access Latency• Access times are aspeed of light issue• Memory technology isalso changingSRAM are getting harder toscaleDRAM is no longer cheapestcost/bit• Power efficiency is anissue here as wellRaul Goycoolea S.Multiprocessor Programming 15716 February 2012
  • 158.PowerDensity(W/cm2)10,0001,000„70 „80 „90 „00 „1010 400480088080180868085286 386486Pentium®Hot PlateNuclear Reactor100Sun‟s SurfaceRocket NozzleIntel Developer Forum, Spring 2004 - Pat Gelsinger(Pentium at 90 W)Cube relationship between the cycle time and powerCPUs ArchitectureHeat becoming an unmanageable problemRaul Goycoolea S.Multiprocessor Programming 15816 February 2012
  • 159.1970 1980 1990 2000 2010Improvement in Automatic ParallelizationAutomaticParallelizingCompilers forFORTRANVectorizationtechnologyPrevalence of typeunsafe languages andcomplex datastructures (C, C++)Typesafelanguages(Java, C#)Demanddriven byMulticores?Compiling forInstructionLevelParallelismRaul Goycoolea S.Multiprocessor Programming 15916 February 2012
  • 160.# of1985 199019801970 1975 1995 2000 2005RawCaviumOcteonRazaXLRCSR-1IntelTflopsPicochipPC102CiscoNiagaraBoardcom 1480Xbox3602010218432cores 1612864512256CellOpteron 4PXeon MPAmbricAM20454004800880868080 286 386 486 PentiumPA-8800 Opteron TanglewoodPower4PExtreme Power6YonahP2 P3 ItaniumP4Athlon Itanium 2Multicores FutureRaul Goycoolea S.Multiprocessor Programming 16016 February 2012
  • 161.EvolutionTrendsArchitectureLanguages, Compilers and ToolsRevolutionCrossing the Abstraction BoundariesOutlineRaul Goycoolea S.Multiprocessor Programming 16116 February 2012
  • 162.Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses–not just a few supercomputer super-usersNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16216 February 2012
  • 163.Don‟t have to contend with uniprocessorsThe era of Moore‟s Law induced performance gains is over!Parallel programming will be required by the masses– not just a few supercomputer super-usersNot your same old multiprocessor problemHow does going from Multiprocessors to Multicores impactprograms?What changed?Where is the Impact?––Communication BandwidthCommunication LatencyNovel Opportunities in MulticoresRaul Goycoolea S.Multiprocessor Programming 16316 February 2012
  • 164.How much data can be communicatedbetween two cores?What changed?Number of Wires––IO is the true bottleneckOn-chip wire density is very highClock rate– IO is slower than on-chipMultiplexingNo sharing of pins–Impact on programming model?Massive data exchange is possibleData movement is not the bottleneckprocessor affinity not that important32 Giga bits/sec ~300 Tera bits/sec10,000XCommunication BandwidthRaul Goycoolea S.Multiprocessor Programming 16416 February 2012
  • 165.How long does it take for a round tripcommunication?What changed?Length of wire– Very short wires are fasterPipeline stages–––No multiplexingOn-chip is much closerBypass and Speculation?Impact on programming model?Ultra-fast synchronizationCan run real-time appson multiple cores50X~200 Cycles ~4 cyclesCommunication LatencyRaul Goycoolea S.Multiprocessor Programming 16516 February 2012
  • 166.MemoryMemoryPE$$PE$$MemoryPEPE$$Memory$$PE$$ XPE$$ XPE$$ XPE$$ XMemory MemoryBasic MulticoreIBM PowerTraditionalMultiprocessorIntegrated Multicore8 Core 8 Thread Oracle T4Past, Present and the Future?Raul Goycoolea S.Multiprocessor Programming 16616 February 2012
  • 167.Summary• As technology evolves, the inherent flexibility of Multiprocessor to adapts to new requirements• Processors can be used at anytime for a lots of kindsof applications• Optimization adapts processors to High PerformancerequirementsRaul Goycoolea S.Multiprocessor Programming 16716 February 2012
  • 168.References• Author: Raul Goycoolea, Oracle Corporation.• A search on the WWW for "parallel programming" or "parallel computing" will yield awide variety of information.• Recommended reading:• "Designing and Building Parallel Programs". Ian Foster. 
http://www-unix.mcs.anl.gov/dbpp/• "Introduction to Parallel Computing". Ananth Grama, Anshul Gupta, George Karypis,Vipin Kumar. 
http://www-users.cs.umn.edu/~karypis/parbook/• "Overview of Recent Supercomputers". A.J. van der Steen, Jack Dongarra.
www.phys.uu.nl/~steen/web03/overview.html• MIT Multicore Programming Class: 6.189• Prof. Saman Amarasinghe• Photos/Graphics have been created by the author, obtained from non-copyrighted,government or public domain (such as http://commons.wikimedia.org/) sources, or usedwith the permission of authors from other presentations and web pages.168
  • 169.<Insert Picture Here>Twitterhttp://twitter.com/raul_goycooleaRaul Goycoolea SeoaneKeep in TouchFacebookhttp://www.facebook.com/raul.goycooleaLinkedinhttp://www.linkedin.com/in/raulgoyBloghttp://blogs.oracle.com/raulgoy/Raul Goycoolea S.Multiprocessor Programming 16916 February 2012

Editor's Notes

  • #4: participant interaction with processes through modeling and collaboration using Process Composer.
  • #24: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • #47: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • #78: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • #99: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • #124: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.
  • #147: Main Point:Introduce the demoScript:Now lets take a look at how Oracle provides some of these capabilities. Our demo will focus on the business user’s participation in the process as both An input to process modeling as well as a collaboration participant.Note: The flash demo illustrates business participant interaction with processes through modeling and collaboration using Process Composer.

[8]ページ先頭

©2009-2025 Movatter.jp