A Tutorial on High Performance Computing Taxonomy

81 slides
0.5 MB
586 views

Similar Presentations

Presentation Transcript

1

A Tutorial on High Performance Computing TaxonomyBy Prof. V. Kamakoti Department of Computer Science and Engineering Indian Institute of Technology, Madras Chennai – 600 036, India

2

Organization of the TutorialSession – 1 Instruction Level Parallelism (ILP) Pipelining concepts RTL and Speed-up Superscalar/VLIW concepts Static Instruction Scheduling Dynamic Instruction Scheduling Branch Prediction

3

Organization of the TutorialSession – 2 Amdahl’s law and its applications Symmetric Multiprocessors (SMP) The Cache Coherency problem ESI Protocol Distributed Memory Systems Basics of Message Passing Systems Parallel Models of Computing Design of Algorithms for Parallel processors Brent’s Lemma

4

Why this Title?Performance related issues at circuit level (RTL) instruction level (Processor level) Shared Memory Multiprocessor level (SMP) Distributed Memory Multiprocessor level (Cluster/Grid – Collection of SMPs)

5

ILP - PipeliningFetch + Inc. PCDecode InstrnFetch DataExecute InstrnStore Data10000 Instructions No pipelining takes 50000 UnitsWith PipeliningI1I2 I1I3 I2 I1I4 I3 I2 I1I5 I4 I3 I2 I1First Instruction completes at end of 5th unit Second instruction at end of 6th unit 10000th instruction at end of 10004 units

6

PerformanceWith pipelining we get a speed up of close to 5 This will not work always Hazards Data Control Structural Non Ideal: Not every step take the same amount of time Float Mult – 10 cycles Float Div – 40 cycles Performance come out of Parallelism. Let us try to understand parallelism

7

Types of ParallelismRecognizing parallelism Example: To add 100 numbers for j = 1 to 100 do Sum = Sum + A[j]; //Inherently Sequential A better solution in terms of parallelism, assuming 4 processors are available Split 100 numbers into 4 parts each of 25 numbers and allot one part each to all the four processors Each processor adds 25 numbers allotted to it and sends the answer to a head processor, which further adds and gives the sum.

8

Types of ParallelismData Parallelism or SIMD The above example Same instruction “Add 25 numbers” but on multiple data The parallelism is because of data.

9

Types of ParallelismFunctional Parallelism or MIMD Multiple functions to be performed on a data set or data sets. Multiple Instructions and Multiple Data. Example is the case of the pipeline discussed earlier

10

Example of PipeliningImagine that 100 sets of data are to be processed in a sequence by the following system.Part 1Part 2Part 1 takes 10 ms and Part 2 takes 15 ms. To process 100 sets of data it takes 2500 ms.

11

Example of PipeliningConsider the following changes, with a storage element. When first data set is in part 2, the second data set can be in part 1.Part 1Part 2First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time is 1515 ms. – A tremendous speedup.S T O R A G E

12

Functional ParallelismDifferent data sets and different instructions on them. Hence, Multiple Instruction and Multiple data. An interesting problem is to convert circuits with large delays into pipelined circuits to get very good throughput as seen earlier. Useful in the context of using the same circuit for different sets of data.

13

Pipelining ConceptsA combinational circuit can be easily modeled as a Directed Acyclic Graph (DAG). Every node of the DAG is a subcircuit of the given circuit – forms a stage of a pipeline. An edge of the DAG connects two nodes of the DAG. Perform a topological sorting of the DAG.

14

N2N3N1N5N6N8N7N4Level = 1Level = 3Level = 2Level = 4

15

A Pipelined CircuitIf an edge connects two nodes of levels j and k, j < k, then introduce k-j storage levels in between, in the edge. Each edge can carry one or more bits.

16

N2N3N1N5N6N8N7N411122324

17

OptimizationDelay at every stage should be al most equal. The stage with maximum delay dictates the throughput. Number of bits transferred across nodes to be optimized, that shall reduce the storage requirements.

18

Stage Time BalancingIf Part 1 takes 10 ms and Part 2 takes 15ms then, First data set finishes at 30ms and after every 15 ms one data set will come out – total processing time for 100 data set is 1515 ms. – A tremendous speedup.If Part 1 takes 12 ms and Part 2 takes 13 ms then, First data set finishes at 26 ms and after every 13 ms one data set will come out – total processing time for 100 data set is 1313 ms. – A significant improvement.

19

RTL View of Circuits and PerformanceRegister Transfer LevelReg L1Reg L2Reg L3Combo 3 nsCombo 5 nsClock Frequency is (1/5)*109 Improve frequency by reducing maximum stage delay.

20

High Speed CircuitsCarry Ripple to Carry Look ahead Wallace Tree Multipliers Increased Area and Power Consumption Lower Time delays Why Laptops have lesser frequency ratings than Desktops? Reference: Cormen, Lieserson and Rivest, (First Edition) Introduction to Algorithms (or) Computer Architecture by Hamacher et al.

21

ILP Continues….Data Hazards LOAD [R2 + 10], R1 // Loads into R1 ADD R3, R1, R2 //R3 = R1 + R2 This is the “Read After Write (RAW)” Data Hazard for R1 LD [R2+10], R1 ADD R3, R1, R12 LD [R2 + 14], R1 ADD R12, R1, R2 This shows the WAW for R1 and WAR for R12

22

ILP – Pipelining AdvancedFetch + Inc. PCDecode InstrnFetch DataExecute Unit 1Store DataExecute Unit 2Execute Unit KSuperscalar: CPI < 1 Success: Different Instrns take different cycle timeFour FMULs while one FDIV Implies – Out-of-Order Execution

23

Difficulties in Superscalar ConstructionEnsuring no Data Hazards among several instructions executing in the different execution units at a same point of time. If this is done by compiler – then Static Instruction Scheduling – VLIW - Itanium Done by the hardware – then Dynamic Instruction Scehduling – Tomasulo – MIPS Embedded Processor

24

Static Instruction SchedulingCompiler make bundles of “K” instructions that can be put at the same time to the execution units such that there are no data dependencies between them. Very Long Instruction Word (VLIW) to accommodate “K’ instructions at a time Lot of “NOPS” if the bundle cannot be filled with relevant instructions Size of the executable Does not complicate the Hardware Source code portability – if I make the next gen processor with K+5 units (say) – then? Solved by having a software/firmware emulator which has a negative say in the performance.

25

Thorn in the Flesh for Static Instruction Scheduling The famous “Memory Aliasing Problem” LD [R1+20], R2 //Load R2 into ST R3, [R4+40] //Store R3 with This implies a RAW if (R1 + 20 = R4 + 40) and this cannot be detected at compile time Such combinations of memory operations are not put in same bundle and memory operations are strictly scheduled in program order.

26

Dynamic Instruction SchedulingThe data hazards are handled by the hardware RAW using Operand Forwarding Technique WAR and WAW using Register Renaming Technique

Browse More Presentations

Last Updated: 8th March 2018

Recommended PPTs