SlideShare une entreprise Scribd logo
1  sur  39
Exploiting Coarse-Grained
Task, Data, and Pipeline Parallelism in
Stream Programs

Dr. C.V. Suresh Babu

1
Multicores Are Here!
512

Picochip
PC102

256

Ambric
AM2045

Cisco
CSR-1

128

Intel
Tflops

64
32

# of
cores 16

Raw

8

Niagara
Broadcom 1480

4
2
1

4004

8080

8086

286

386

486

Pentium

8008

1970
2

Raza
XLR

1975

1980

1985

1990

Cavium
Octeon

Cell
Opteron 4P
Xeon MP

Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2

1995

2000

2005

20??
Multicores Are Here!
512
256
128
64
32

# of
cores 16
8
4

For uniprocessors,
Uniprocessors:
C was:
C •is the common
Portable
machine language
•High Performance
•Composable
•Malleable
•Maintainable

Picochip
PC102
Cisco
CSR-1

Intel
Tflops

Raw

1

8086

286

386

486

Broadcom 1480

Pentium

8008

1970
3

8080

1975

1980

1985

1990

Raza
XLR

Niagara

2
4004

Ambric
AM2045

Cavium
Octeon

Cell
Opteron 4P
Xeon MP

Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2

1995

2000

2005

20??
Multicores Are Here!
What is the common
machine language
for multicores?

512
256
128

Picochip
PC102

Ambric
AM2045

Cisco
CSR-1

Intel
Tflops

64
32

# of
cores 16

Raw

8

Niagara
Broadcom 1480

4
2
1

4004

8080

8086

286

386

486

Pentium

8008

1970
4

Raza
XLR

1975

1980

1985

1990

Cavium
Octeon

Cell
Opteron 4P
Xeon MP

Xbox360
PA-8800 Opteron
Tanglewood
Power4
PExtreme Power6
Yonah
P2 P3 Itanium
P4
Athlon
Itanium 2

1995

2000

2005

20??
Common Machine Languages
Uniprocessors:
Common Properties

Multicores:
Common Properties

Single flow of control

Multiple flows of control

Single memory image

Multiple local memories

Differences:

Differences:

Number and capabilities of cores
Register Allocation
Communication Model
ISA Instruction Selection
Synchronization Model
Functional Units Instruction Scheduling
Register File

von-Neumann languages represent the
common properties and abstract away
the differences
5

Need common machine language(s)
for multicores
Streaming as a Common Machine Language
AtoD

• Regular and repeating computation

FMDemod

• Independent filters
with explicit communication
– Segregated address spaces and
multiple program counters

Scatter

– Producer / Consumer dependencies
– Enables powerful, whole-program
transformations

LPF2

LPF3

HPF1

• Natural expression of Parallelism:

LPF1

HPF2

HPF3

Gather
Adder
Speaker

6
Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
– Between filters without
producer/consumer relationship
Scatter

Gather

7

Task

Data Parallelism
– Peel iterations of filter, place within
scatter/gather pair (fission)
– parallelize filters with state
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
Types of Parallelism
Task Parallelism
– Parallelism explicit in algorithm
Data Parallel
– Between filters without
Gather
producer/consumer relationship

Scatter

Pipeline

Scatter

Gather
Data

8

Task

Data Parallelism
– Between iterations of a stateless filter
– Place within scatter/gather pair (fission)
– Can’t parallelize filters with state
Pipeline Parallelism
– Between producers and consumers
– Stateful filters can be parallelized
Types of Parallelism
Traditionally:

Scatter

Gather

Pipeline

Scatter
Data Parallelism
– Data parallel loop (forall)

Gather
Data

9

Task Parallelism
– Thread (fork/join) parallelism

Task

Pipeline Parallelism
– Usually exploited in hardware
Problem Statement
Given:
– Stream graph with compute and communication
estimate for each filter
– Computation and communication resources of
the target machine
Find:
– Schedule of execution for the filters that best
utilizes the available parallelism to fit the
machine resources
10
Our 3-Phase Solution

Coarsen
Granularity

Data
Parallelize

Software
Pipeline

1. Coarsen: Fuse stateless sections of the graph
2. Data Parallelize: parallelize stateless filters
3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture
–
11

11.2x mean throughput speedup over single core
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution

12
The StreamIt Project
• Applications

StreamIt Program

– DES and Serpent [PLDI 05]
– MPEG-2 [IPDPS 06]
– SAR, DSP benchmarks, JPEG, …

Front-end

• Programmability
– StreamIt Language (CC 02)
– Teleport Messaging (PPOPP 05)
– Programming Environment in Eclipse (P-PHEC 05)

Annotated Java

• Domain Specific Optimizations
– Linear Analysis and Optimization (PLDI 03)
– Optimizations for bit streaming (PLDI 05)
– Linear State Space Analysis (CASES 05)

Simulator
(Java Library)

Stream-Aware
Optimizations

• Architecture Specific Optimizations
– Compiling for Communication-Exposed
Architectures (ASPLOS 02)
– Phased Scheduling (LCTES 03)
– Cache Aware Optimization (LCTES 05)
– Load-Balanced Rendering
(Graphics Hardware 05)
13

Uniprocessor
backend

Cluster
backend

Raw
backend

IBM X10
backend

C/C++

MPI-like
C/C++

C per tile +
msg code

Streaming
X10 runtime
Model of Computation
• Synchronous Dataflow [Lee ‘92]
A/D

– Graph of autonomous filters
– Communicate via FIFO channels

Band Pass

• Static I/O rates
– Compiler decides on an order
of execution (schedule)
Detect
– Static estimation of
computation
LED

14

Duplicate

Detect

Detect

Detect

LED

LED

LED
Example StreamIt Filter
0

1

2

3

4

5

6

7

8

9 10 11

FIR
0

1

output

float→float filter FIR (int N, float[N] weights) {
work push 1 pop 1 peek N {
float result = 0;

Stateless

for (int i = 0; i < N; i++) {
result += weights[i] ∗ peek(i);
}
pop();
push(result);
}
}
15

input
Example StreamIt Filter
0

1

2

3

4

5

6

7

8

9 10 11

FIR
0

1

output

float→float filter FIR (int N, float[N] weights) {
N) {
;
Stateful
work push 1 pop 1 peek N {
float result = 0;
weights = adaptChannel(weights);
for (int i = 0; i < N; i++) {
result += weights[i] ∗ peek(i);
}
pop();
push(result);
}
}
16

input
StreamIt Language Overview
• StreamIt is a novel
language for streaming
– Exposes parallelism and
communication
– Architecture independent
– Modular and composable
– Simple structures
composed to creates
complex graphs

filter
pipeline
may be
any StreamIt
language
construct

splitjoin

splitter

parallel computation

joiner

– Malleable
– Change program behavior
with small modifications

feedback loop
joiner

17

splitter
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution

18
Baseline 1: Task Parallelism
• Inherent task parallelism between
two processing pipelines

Splitter

BandPass

BandPass

Compress

Compress

Process

Process

Expand

Expand

BandStop

BandStop
Joiner

Adder
19

• Task Parallel Model:
– Only parallelize explicit task
parallelism
– Fork/join parallelism
• Execute this on a 2 core machine
~2x speedup over single core
• What about 4, 16, 1024, … cores?
Throughput Normalized to Single Core StreamIt

Evaluation: Task Parallelism
Raw Microprocessor
Parallelism: Not matched to target!
16 inorder, single-issue cores with D$ and I$
Synchronization: Not matched to with DMA
16 memory banks, each bank target!

19
18
17
16

Cycle accurate simulator

15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

20

n
M
ea

da
r
m
et

ric

R
a
G
eo

er
oc
od
V

od
er
G
2D
ec

T
D
E

P
E
M

t
S
er
pe
n

F
M
R

ad
i

o

k
er
ba
n
F
ilt

T
F
F

D
E
S

T
D
C

oc
lV

nn
e

C
ha

B
it o
ni

cS
or
t

od
e

r

0
Baseline 2: Fine-Grained
Data Parallelism
Splitter

Splitter

Joiner

Splitter

BandPass
BandPass
BandPass
BandPass

BandPass
BandPass
BandPass
BandPass

Splitter

Splitter

Compress
Compress
Compress
Compress

Compress
Compress
Compress
Compress
Joiner

Joiner

Splitter

Process
Process
Process
Process

Joiner

Splitter

Splitter

Expand
Expand
Expand
Expand
BandStop
BandStop
BandStop
BandStop

Process
Process
Process
Process
Expand
Expand
Expand
Expand

Joiner

Splitter

Splitter

Joiner

Splitter

BandStop
BandStop
BandStop
BandStop
Joiner

Joiner
Splitter

– Fiss each stateless filter N
ways (N is number of cores)
– Remove scatter/gather if
possible

• We can introduce data
parallelism

Joiner

Joiner

– Example: 4 cores

• Each fission group occupies
entire machine

BandStop
BandStop
BandStop
Adder
Adder
Joiner

21

Joiner

• Each of the filters in the
example are stateless
• Fine-grained Data Parallel
Model:
22
G

E

et
ri c

ea
n

ad
ar
M

R

Vo
co
de
r

G
2D
ec
od
er

eo
m

PE

TD

Se
rp
en
t

ad
io

16

R

an
k

Task
Fine-Grained Data

FM

T

ES

C

FF
T

D

D

17

Vo
co
de
r

or
t

18

Fi
lte
rb

ha
nn
el

Throughput Normalized to Single Core StreamIt
19

M

C

Bi
to
ni
cS

Evaluation:
Fine-Grained Data Parallelism
Good Parallelism!
Too Much Synchronization!

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0
Outline
• StreamIt language overview
• Mapping to multicores
– Baseline techniques
– Our 3-phase solution

23
Phase 1: Coarsen the Stream Graph
Splitter

BandPass

Peek

BandPass

Compress

Compress

Process

Process

Expand

Expand

BandStop

Peek

Joiner

Adder

24

Peek

BandStop

Peek

• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream
Phase 1: Coarsen the Stream Graph
Splitter

BandPass
Compress
Process
Expand

BandPass
Compress
Process
Expand

BandStop

BandStop

• Before data-parallelism is
exploited
• Fuse stateless pipelines as
much as possible without
introducing state
– Don’t fuse stateless with
stateful
– Don’t fuse a peeking filter with
anything upstream

• Benefits:
Joiner

Adder

25

– Reduces global communication
and synchronization
– Exposes inter-node
optimization opportunities
Phase 2: Data Parallelize
Data Parallelize for 4 cores

Splitter

BandPass
Compress
Process
Expand

BandPass
Compress
Process
Expand

BandStop

BandStop

Joiner

Adder
Adder
Adder
Adder
Joiner

26

Splitter

Fiss 4 ways, to occupy entire chip
Phase 2: Data Parallelize
Data Parallelize for 4 cores

Splitter
Splitter

Splitter

BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand

BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner

Joiner

BandStop

BandStop

Joiner

Adder
Adder
Adder
Adder
Joiner

27

Splitter

Task parallelism!
Each fused filter does equal work
Fiss each filter 2 times to occupy entire chip
Phase 2: Data Parallelize
Data Parallelize for 4 cores

Splitter
Splitter

Splitter

BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand

BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
Joiner

Joiner

Splitter

– Preserve task parallelism

• Benefits:
– Reduces global communication
and synchronization

Splitter

BandStop
BandStop

BandStop
BandStop
Joiner

Joiner
Joiner

Adder
Adder
Adder
Adder
Joiner

28

• Task-conscious data
parallelization

Splitter

Task parallelism, each filter does equal work
Fiss each filter 2 times to occupy entire chip
Evaluation:
Coarse-Grained Data Parallelism
Task
Fine-Grained Data
Coarse-Grained Task + Data

19

Throughput Normalized to Single Core StreamIt

18
17

Good Parallelism!
Low Synchronization!

16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

29

n

r

M
ea

ad
a
G
eo

m

et
ri

c

R

r
co
de
Vo

r
ec
od
e

EG
2D

TD
E
M
P

t
rp
en
Se

ad
io
FM
R

rb
an
k
Fi
l te

T
FF

ES
D

CT
D

er
lV
oc
od

ha
nn
e
C

Bi
to

ni
c

So

rt

0
Simplified Vocoder
Splitter

6

AdaptDFT

AdaptDFT

6

Joiner

RectPolar

20

Data Parallel

Splitter
Splitter

2

UnWrap

Unwrap

2

1

Diff

Diff

1

1

Amplify

Amplify

1

1

Accum

Accum

1

Data Parallel, but too little work!

Joiner
Joiner

PolarRect

30

20

Data Parallel
Target a 4 core machine
Data Parallelize
Splitter

6

AdaptDFT

AdaptDFT

6

Joiner
Splitter

RectPolar
RectPolar
RectPolar
RectPolar

20 5
Joiner

Splitter
Splitter

2

UnWrap

Unwrap

2

1

Diff

Diff

1

1

Amplify

Amplify

1

1

Accum

Accum

1

Joiner
Joiner
Splitter

RectPolar
RectPolar
RectPolar
PolarRect

20 5
Joiner

31

Target a 4 core machine
Data + Task Parallel Execution
Splitter

6

6

Cores

Joiner
Splitter

5
Joiner

Splitter
Splitter

2

2

1

1

1

1

1

1

Time

21

Joiner
Joiner
Splitter

5

RectPolar
Joiner

32

Target 4 core machine
We Can Do Better!
Splitter

6

6

Cores

Joiner
Splitter

5
Joiner

Splitter
Splitter

2

2

1

1

1

1

1

1

Time

16

Joiner
Joiner
Splitter

5

RectPolar
Joiner

33

Target 4 core machine
Phase 3: Coarse-Grained
Software Pipelining
Prologue

New
Steady
State

RectPolar

RectPolar

• New steady-state is free of
dependencies
• Schedule new steady-state
using a greedy partitioning
34

RectPolar

RectPolar
Greedy Partitioning
Cores

To Schedule:

Time

35

16

Target 4 core machine
M
ea
n

ad
ar

c
et
ri

co
de
r
Vo

R
G
eo
m

M
PE

G
2D

ec
od
er

TD
E

rp
en
t
Se

ad
io
FM
R

rb
an
k
Fi
lte

T
FF

ES
D

CT

ha
nn
el
Vo
co
de
r

C

36

Fine-Grained Data
Coarse-Grained Task + Data + Software Pipeline

Best Parallelism!
Lowest Synchronization!

D

19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0

Task
Coarse-Grained Task + Data

Bi
to
ni
cS
or
t

Throughput Normalized to Single Core StreamIt

Evaluation: Coarse-Grained
Task + Data + Software Pipelining
Generalizing to Other Multicores
• Architectural requirements:
– Compiler controlled local memories with DMA
– Efficient implementation of scatter/gather
• To port to other architectures, consider:
– Local memory capacities
– Communication to computation tradeoff
• Did not use processor-to-processor
communication on Raw
37
Related Work
• Streaming languages:
– Brook [Buck et al. ’04]
– StreamC/KernelC [Kapasi ’03, Das et al. ’06]
– Cg [Mark et al. ‘03]
– SPUR [Zhang et al. ‘05]
• Streaming for Multicores:
– Brook [Liao et al., ’06]
• Ptolemy [Lee ’95]
• Explicit parallelism:
– OpenMP, MPI, & HPF
38
Conclusions
• Streaming model naturally exposes task, data, and
pipeline parallelism
• This parallelism must be exploited at the correct
granularity and combined correctly
Task

Fine-Grained Coarse-Grained
Data
Task + Data

Coarse-Grained
Task + Data +
Software Pipeline

Parallelism

Not
matched

Good

Good

Best

Synchronization

Not
matched

High

Low

Lowest

• Good speedups across varied benchmark suite
• Algorithms should be applicable across multicores
39

Contenu connexe

Tendances

Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating SystemsDr Sandeep Kumar Poonia
 
Deadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDeadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDHIVYADEVAKI
 
Chapter 2 Time boxing &amp; agile models
Chapter 2   Time boxing &amp; agile modelsChapter 2   Time boxing &amp; agile models
Chapter 2 Time boxing &amp; agile modelsGolda Margret Sheeba J
 
Code optimization in compiler design
Code optimization in compiler designCode optimization in compiler design
Code optimization in compiler designKuppusamy P
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada umardanjumamaiwada
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory Rajendran
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processingKamal Acharya
 
Operating system 24 mutex locks and semaphores
Operating system 24 mutex locks and semaphoresOperating system 24 mutex locks and semaphores
Operating system 24 mutex locks and semaphoresVaibhav Khanna
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsDr. Mohamed Torky
 
Asymptotic Notation and Complexity
Asymptotic Notation and ComplexityAsymptotic Notation and Complexity
Asymptotic Notation and ComplexityRajandeep Gill
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleAhmed Gad
 

Tendances (20)

Concurrency control
Concurrency controlConcurrency control
Concurrency control
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Deadlock Detection in Distributed Systems
Deadlock Detection in Distributed SystemsDeadlock Detection in Distributed Systems
Deadlock Detection in Distributed Systems
 
Type Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLikeType Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLike
 
Chapter 2 Time boxing &amp; agile models
Chapter 2   Time boxing &amp; agile modelsChapter 2   Time boxing &amp; agile models
Chapter 2 Time boxing &amp; agile models
 
Code optimization in compiler design
Code optimization in compiler designCode optimization in compiler design
Code optimization in compiler design
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada file sharing semantics by Umar Danjuma Maiwada
file sharing semantics by Umar Danjuma Maiwada
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory
 
Phases of Compiler
Phases of CompilerPhases of Compiler
Phases of Compiler
 
24 Multithreaded Algorithms
24 Multithreaded Algorithms24 Multithreaded Algorithms
24 Multithreaded Algorithms
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Operating system 24 mutex locks and semaphores
Operating system 24 mutex locks and semaphoresOperating system 24 mutex locks and semaphores
Operating system 24 mutex locks and semaphores
 
Petri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and ApplicationsPetri Nets: Properties, Analysis and Applications
Petri Nets: Properties, Analysis and Applications
 
Asymptotic Notation and Complexity
Asymptotic Notation and ComplexityAsymptotic Notation and Complexity
Asymptotic Notation and Complexity
 
11. dfs
11. dfs11. dfs
11. dfs
 
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step ExampleGenetic Algorithm (GA) Optimization - Step-by-Step Example
Genetic Algorithm (GA) Optimization - Step-by-Step Example
 

En vedette

Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programmingRaul Goycoolea Seoane
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Consistency And Parallelism Presentation
Consistency And Parallelism PresentationConsistency And Parallelism Presentation
Consistency And Parallelism Presentationjesler
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Computer architecture
Computer architecture Computer architecture
Computer architecture Ashish Kumar
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classificationvani gupta
 

En vedette (10)

Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programming
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Consistency And Parallelism Presentation
Consistency And Parallelism PresentationConsistency And Parallelism Presentation
Consistency And Parallelism Presentation
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Parallel Computing
Parallel ComputingParallel Computing
Parallel Computing
 
Types of parallelism
Types of parallelismTypes of parallelism
Types of parallelism
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
 
What is Parallelism?
What is Parallelism?What is Parallelism?
What is Parallelism?
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
 

Similaire à Pipeline parallelism

SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)Yuuki Takano
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 AprFNian
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...KRamasamy2
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...VAISHNAVI MADHAN
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4Open Networking Summits
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggleconfluent
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterIgalia
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Haidee McMahon
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDKLagopus SDN/OpenFlow switch
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaJim St. Leger
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] IO Visor Project
 

Similaire à Pipeline parallelism (20)

SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 Apr
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4[Webinar Slides] Programming the Network Dataplane in P4
[Webinar Slides] Programming the Network Dataplane in P4
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Snabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporterSnabbflow: A Scalable IPFIX exporter
Snabbflow: A Scalable IPFIX exporter
 
Dsp ajal
Dsp  ajalDsp  ajal
Dsp ajal
 
A Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGAA Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGA
 
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
 
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
DPDK summit 2015: It's kind of fun  to do the impossible with DPDKDPDK summit 2015: It's kind of fun  to do the impossible with DPDK
DPDK summit 2015: It's kind of fun to do the impossible with DPDK
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Shantanu's Resume
Shantanu's ResumeShantanu's Resume
Shantanu's Resume
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016] CETH for XDP [Linux Meetup Santa Clara | July 2016]
CETH for XDP [Linux Meetup Santa Clara | July 2016]
 

Plus de Dr. C.V. Suresh Babu (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Association rules
Association rulesAssociation rules
Association rules
 
Clustering
ClusteringClustering
Clustering
 
Classification
ClassificationClassification
Classification
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
DART
DARTDART
DART
 
Mycin
MycinMycin
Mycin
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 
Bayes network
Bayes networkBayes network
Bayes network
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Rule based system
Rule based systemRule based system
Rule based system
 
Formal Logic in AI
Formal Logic in AIFormal Logic in AI
Formal Logic in AI
 
Production based system
Production based systemProduction based system
Production based system
 
Game playing in AI
Game playing in AIGame playing in AI
Game playing in AI
 
Diagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AIDiagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AI
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 

Dernier

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 

Dernier (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 

Pipeline parallelism

  • 1. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Dr. C.V. Suresh Babu 1
  • 2. Multicores Are Here! 512 Picochip PC102 256 Ambric AM2045 Cisco CSR-1 128 Intel Tflops 64 32 # of cores 16 Raw 8 Niagara Broadcom 1480 4 2 1 4004 8080 8086 286 386 486 Pentium 8008 1970 2 Raza XLR 1975 1980 1985 1990 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20??
  • 3. Multicores Are Here! 512 256 128 64 32 # of cores 16 8 4 For uniprocessors, Uniprocessors: C was: C •is the common Portable machine language •High Performance •Composable •Malleable •Maintainable Picochip PC102 Cisco CSR-1 Intel Tflops Raw 1 8086 286 386 486 Broadcom 1480 Pentium 8008 1970 3 8080 1975 1980 1985 1990 Raza XLR Niagara 2 4004 Ambric AM2045 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20??
  • 4. Multicores Are Here! What is the common machine language for multicores? 512 256 128 Picochip PC102 Ambric AM2045 Cisco CSR-1 Intel Tflops 64 32 # of cores 16 Raw 8 Niagara Broadcom 1480 4 2 1 4004 8080 8086 286 386 486 Pentium 8008 1970 4 Raza XLR 1975 1980 1985 1990 Cavium Octeon Cell Opteron 4P Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Athlon Itanium 2 1995 2000 2005 20??
  • 5. Common Machine Languages Uniprocessors: Common Properties Multicores: Common Properties Single flow of control Multiple flows of control Single memory image Multiple local memories Differences: Differences: Number and capabilities of cores Register Allocation Communication Model ISA Instruction Selection Synchronization Model Functional Units Instruction Scheduling Register File von-Neumann languages represent the common properties and abstract away the differences 5 Need common machine language(s) for multicores
  • 6. Streaming as a Common Machine Language AtoD • Regular and repeating computation FMDemod • Independent filters with explicit communication – Segregated address spaces and multiple program counters Scatter – Producer / Consumer dependencies – Enables powerful, whole-program transformations LPF2 LPF3 HPF1 • Natural expression of Parallelism: LPF1 HPF2 HPF3 Gather Adder Speaker 6
  • 7. Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Gather 7 Task Data Parallelism – Peel iterations of filter, place within scatter/gather pair (fission) – parallelize filters with state Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized
  • 8. Types of Parallelism Task Parallelism – Parallelism explicit in algorithm Data Parallel – Between filters without Gather producer/consumer relationship Scatter Pipeline Scatter Gather Data 8 Task Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair (fission) – Can’t parallelize filters with state Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized
  • 9. Types of Parallelism Traditionally: Scatter Gather Pipeline Scatter Data Parallelism – Data parallel loop (forall) Gather Data 9 Task Parallelism – Thread (fork/join) parallelism Task Pipeline Parallelism – Usually exploited in hardware
  • 10. Problem Statement Given: – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10
  • 11. Our 3-Phase Solution Coarsen Granularity Data Parallelize Software Pipeline 1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11 11.2x mean throughput speedup over single core
  • 12. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 12
  • 13. The StreamIt Project • Applications StreamIt Program – DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, … Front-end • Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) – Programming Environment in Eclipse (P-PHEC 05) Annotated Java • Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) – Optimizations for bit streaming (PLDI 05) – Linear State Space Analysis (CASES 05) Simulator (Java Library) Stream-Aware Optimizations • Architecture Specific Optimizations – Compiling for Communication-Exposed Architectures (ASPLOS 02) – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering (Graphics Hardware 05) 13 Uniprocessor backend Cluster backend Raw backend IBM X10 backend C/C++ MPI-like C/C++ C per tile + msg code Streaming X10 runtime
  • 14. Model of Computation • Synchronous Dataflow [Lee ‘92] A/D – Graph of autonomous filters – Communicate via FIFO channels Band Pass • Static I/O rates – Compiler decides on an order of execution (schedule) Detect – Static estimation of computation LED 14 Duplicate Detect Detect Detect LED LED LED
  • 15. Example StreamIt Filter 0 1 2 3 4 5 6 7 8 9 10 11 FIR 0 1 output float→float filter FIR (int N, float[N] weights) { work push 1 pop 1 peek N { float result = 0; Stateless for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } } 15 input
  • 16. Example StreamIt Filter 0 1 2 3 4 5 6 7 8 9 10 11 FIR 0 1 output float→float filter FIR (int N, float[N] weights) { N) { ; Stateful work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights); for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } } 16 input
  • 17. StreamIt Language Overview • StreamIt is a novel language for streaming – Exposes parallelism and communication – Architecture independent – Modular and composable – Simple structures composed to creates complex graphs filter pipeline may be any StreamIt language construct splitjoin splitter parallel computation joiner – Malleable – Change program behavior with small modifications feedback loop joiner 17 splitter
  • 18. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 18
  • 19. Baseline 1: Task Parallelism • Inherent task parallelism between two processing pipelines Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandStop BandStop Joiner Adder 19 • Task Parallel Model: – Only parallelize explicit task parallelism – Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core • What about 4, 16, 1024, … cores?
  • 20. Throughput Normalized to Single Core StreamIt Evaluation: Task Parallelism Raw Microprocessor Parallelism: Not matched to target! 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched to with DMA 16 memory banks, each bank target! 19 18 17 16 Cycle accurate simulator 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 20 n M ea da r m et ric R a G eo er oc od V od er G 2D ec T D E P E M t S er pe n F M R ad i o k er ba n F ilt T F F D E S T D C oc lV nn e C ha B it o ni cS or t od e r 0
  • 21. Baseline 2: Fine-Grained Data Parallelism Splitter Splitter Joiner Splitter BandPass BandPass BandPass BandPass BandPass BandPass BandPass BandPass Splitter Splitter Compress Compress Compress Compress Compress Compress Compress Compress Joiner Joiner Splitter Process Process Process Process Joiner Splitter Splitter Expand Expand Expand Expand BandStop BandStop BandStop BandStop Process Process Process Process Expand Expand Expand Expand Joiner Splitter Splitter Joiner Splitter BandStop BandStop BandStop BandStop Joiner Joiner Splitter – Fiss each stateless filter N ways (N is number of cores) – Remove scatter/gather if possible • We can introduce data parallelism Joiner Joiner – Example: 4 cores • Each fission group occupies entire machine BandStop BandStop BandStop Adder Adder Joiner 21 Joiner • Each of the filters in the example are stateless • Fine-grained Data Parallel Model:
  • 22. 22 G E et ri c ea n ad ar M R Vo co de r G 2D ec od er eo m PE TD Se rp en t ad io 16 R an k Task Fine-Grained Data FM T ES C FF T D D 17 Vo co de r or t 18 Fi lte rb ha nn el Throughput Normalized to Single Core StreamIt 19 M C Bi to ni cS Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization! 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
  • 23. Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution 23
  • 24. Phase 1: Coarsen the Stream Graph Splitter BandPass Peek BandPass Compress Compress Process Process Expand Expand BandStop Peek Joiner Adder 24 Peek BandStop Peek • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream
  • 25. Phase 1: Coarsen the Stream Graph Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop • Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream • Benefits: Joiner Adder 25 – Reduces global communication and synchronization – Exposes inter-node optimization opportunities
  • 26. Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter BandPass Compress Process Expand BandPass Compress Process Expand BandStop BandStop Joiner Adder Adder Adder Adder Joiner 26 Splitter Fiss 4 ways, to occupy entire chip
  • 27. Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner BandStop BandStop Joiner Adder Adder Adder Adder Joiner 27 Splitter Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip
  • 28. Phase 2: Data Parallelize Data Parallelize for 4 cores Splitter Splitter Splitter BandPass BandPass Compress Compress Process Process Expand Expand BandPass BandPass Compress Compress Process Process Expand Expand Joiner Joiner Splitter – Preserve task parallelism • Benefits: – Reduces global communication and synchronization Splitter BandStop BandStop BandStop BandStop Joiner Joiner Joiner Adder Adder Adder Adder Joiner 28 • Task-conscious data parallelization Splitter Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip
  • 29. Evaluation: Coarse-Grained Data Parallelism Task Fine-Grained Data Coarse-Grained Task + Data 19 Throughput Normalized to Single Core StreamIt 18 17 Good Parallelism! Low Synchronization! 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 29 n r M ea ad a G eo m et ri c R r co de Vo r ec od e EG 2D TD E M P t rp en Se ad io FM R rb an k Fi l te T FF ES D CT D er lV oc od ha nn e C Bi to ni c So rt 0
  • 32. Data + Task Parallel Execution Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 21 Joiner Joiner Splitter 5 RectPolar Joiner 32 Target 4 core machine
  • 33. We Can Do Better! Splitter 6 6 Cores Joiner Splitter 5 Joiner Splitter Splitter 2 2 1 1 1 1 1 1 Time 16 Joiner Joiner Splitter 5 RectPolar Joiner 33 Target 4 core machine
  • 34. Phase 3: Coarse-Grained Software Pipelining Prologue New Steady State RectPolar RectPolar • New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning 34 RectPolar RectPolar
  • 36. M ea n ad ar c et ri co de r Vo R G eo m M PE G 2D ec od er TD E rp en t Se ad io FM R rb an k Fi lte T FF ES D CT ha nn el Vo co de r C 36 Fine-Grained Data Coarse-Grained Task + Data + Software Pipeline Best Parallelism! Lowest Synchronization! D 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Task Coarse-Grained Task + Data Bi to ni cS or t Throughput Normalized to Single Core StreamIt Evaluation: Coarse-Grained Task + Data + Software Pipelining
  • 37. Generalizing to Other Multicores • Architectural requirements: – Compiler controlled local memories with DMA – Efficient implementation of scatter/gather • To port to other architectures, consider: – Local memory capacities – Communication to computation tradeoff • Did not use processor-to-processor communication on Raw 37
  • 38. Related Work • Streaming languages: – Brook [Buck et al. ’04] – StreamC/KernelC [Kapasi ’03, Das et al. ’06] – Cg [Mark et al. ‘03] – SPUR [Zhang et al. ‘05] • Streaming for Multicores: – Brook [Liao et al., ’06] • Ptolemy [Lee ’95] • Explicit parallelism: – OpenMP, MPI, & HPF 38
  • 39. Conclusions • Streaming model naturally exposes task, data, and pipeline parallelism • This parallelism must be exploited at the correct granularity and combined correctly Task Fine-Grained Coarse-Grained Data Task + Data Coarse-Grained Task + Data + Software Pipeline Parallelism Not matched Good Good Best Synchronization Not matched High Low Lowest • Good speedups across varied benchmark suite • Algorithms should be applicable across multicores 39

Notes de l'éditeur

  1. Lets look at a simplified graph of showing a log scale of the number of cores for some commodity microprocessors versus their date of introduction. For 35 years, uniprocessor designs dominated the commodity microprocessor market. But their end has come due to the limited scalability of their global, monolithic structures. The major cpu vendors have stopped development of uniprocessor designs. In the last 5 years, commodity processor designers have turned to multicore designs to continue performance scalability. From 2 cores up to hundreds of cores.
  2. During the age of the uniprocessor, programmers had it pretty easy. They could write a piece of code in c or another von Neumann language and have it be portable and scalable across the generations of uniprocessor designs. Furthermore, von Neumann languages had all these other great benefits… debuggable We could say that c and von Neumann languages were the common machine languages for uniprocessors. They encapsulated the common properties but abstracted away the differences across uniprocessor architectures.
  3. Now that multicore designs are becoming dominant, we want to ask ourselves what is the common machine language for them. We would like to write a program once and have it be portable and scale with future generations of multicore designs. Also, we want parallel programming has to become as easy as sequential programming thus, the common machine language should be composeable, malleable, and maintainable and the mapping burden should be squarely on the compiler.
  4. also a common machine language must not be too complex so that it can attract typical programmers. memories maybe shared, does not matter if it is exposed in the lang. If we were to use a von Neumann language as a common machine language for a multicore, it would reaquire heroic efforts from the compiler there are many research proposals for such a common machine language, one representation that will cover a large part of the application space is a streaming representation our research proposes streaming languages as the common machine language across multicore architectures Specifically, in this talk we develop algorithms to enable portability and high-performance across multicores beginning from a high-level stream language. our work focuses on compiler algorithms to break the abstraction layer between the language and the number of cores/capability of cores Enable portability and high-performance across multicores
  5. streaming applications are becoming increasing prevalent in general purpose processing, already a large part of desktop and embedded workloads outer define stateless! composable, malleable, debuggable (because it is deterministic)
  6. Abundance of parallelism in streaming programs, pipeline parallelism because these filters execute repeatedly, if they are mapped to different computing cores, we might be able to take advantage of pipeline parallelism, chains of producer and consumers each of the resulting duplicate products executes less times than the original, they are placed in a round robin splitter… a filter can be data-parallelized if it is stateless, meaning that the filter does not write to any variable that is read by a later iteration. we have the nice properties that each type of parallelism can be naturally expressed in the stream graph
  7. Abundance of parallelism in streaming programs, Some streaming representations require that all filters are data parallel, we don’t have this requirement, the compiler discovers data parallel filters (see below) pipeline parallelism because these filters execute repeatedly, if they are mapped to different computing cores, we might be able to take advantage of pipeline parallelism, chains of producer and consumers each of the resulting duplicate products executes less times than the original, they are placed in a round robin splitter… a filter can be data-parallelized if it is stateless, meaning that the filter does not write to any variable that is read by a later iteration. we have the nice properties that each type of parallelism can be naturally expressed in the stream graph
  8. no filter dynamic filter migration each filter is mapped to a single core concerned with thruput!
  9. coarsen gran to reduce communication overhead Decreases global communication and synchronization Enables internode optimizations on fused filter data parallelize to parallel stateless components to exploit pipeline parallelism we perform software pipeline to parallelize remaining components after a prologue schedule is executed, we can statically schedule the filters in any order in the steady state… data parallelize to parallel stateless components to exploit pipeline parallelism we perform software pipeline to parallelize remaining components after a prologue schedule is executed, we can statically schedule the filters in any order in the steady state We employ a greedy heuristic to schedule the software pipelined steady-sate
  10. over the last 5 years we have been developing….
  11. Many possible legal schedules frequency band detection, used in garage door openers, metal detectors Highlight “filters” Replace with filterbank Every language has an underlying model of computation Streamit adopts SDF Programs as graph, nodes: filters i.e. kernels, which represent autonomous/standalone computation kernels; edges represent data flow between kernels Compiler orchestrate execution of kernels: this is the schedule As the example earlier showed, example can affect locality: how often does a kernel get loaded/reloaded into cache A lot of previous emphasis on minimizing data requirements between filters but as we will show, this in the presence of caches, this isn’t a good strategy as we will show
  12. start off with the work function, atomic unit of execution This is the work function, it is repeated executed by the filter emphasize peek! stateful versus stateless filters Stateless filters can be data-parallelized! Talk through filter example: what computation looks like in StreamIt Highlight peek/pop/push rates, and work function Parameterizable: takes argument N Emphasize the code! we can detect stateless filters using a simple program analysis
  13. start off with the work function, atomic unit of execution This is the work function, it is repeated executed by the filter emphasize peek! stateful versus stateless filters Stateless filters can be data-parallelized! Talk through filter example: what computation looks like in StreamIt Highlight peek/pop/push rates, and work function Parameterizable: takes argument N Emphasize the code! we can detect stateless filters using a simple program analysis
  14. Mention that splitter can be duplicate or round-robin and joiner can be round robin the streams of a pipeline/splitjoin do not have to be all the same parameterized graphs The streamit language is designed with productivity in mind - natural for the program to represent computation - expose what is traditionally hard for the compiler to discover: namely, parallelism and communication The language is also designed to be modular/composable: important to software engineering, and also for correctness checking Show stream constructs: - filter: basic unit of computation - pipeline: sequential/serial execution of streams that communication data &gt; a stream can be a filter or any of the language stream constructs: modularity/reusability - sj: explicit parallelism, scatter data with splitter, gather data with joiner - also a feedback loop allows for cycles in the graph The stream constructs are parameterizable: length of pipeline, width of sj This gives rise to malleabililty: small change in code leads to big changes in program graph Example on the next slide
  15. each pipeline filters a different frequency
  16. scale top to 16 , and use the same scale for each, height etc mention problems with task parallelism not adequate as only source of parallelism highlight bitonic sort, mention the problems with communication granularity how large is filterbank? does the number match?
  17. where direct communication is possible, we remove the scatter/gather, but we need scatter/gather between data-parallel filters with non-matching i/o rates
  18. Think about something to say about filterbank! either make the legend bigger or say what the legend is!
  19. A filter that peek performs computation on a sliding window computation and items need to be preserved across invocations of the filter
  20. Define fusion Akin to inlining
  21. each pipeline filters a different frequency remember that a streamit is hierarchical so a pipeline element can be a nested splitjoin, pipeline, or a filter
  22. More detail into benefits Naturally take advantage of task parallelism and avoid added synchronization imposed by filter fission
  23. Significant amount of state…
  24. annotated with relative load estimations for one execution of the stream graph We don’t want to data parallelize the amplify filters because they perform such a small amount work per steady state, if we data-parallelize, the scatter and gather will cost more than the parallelized computation we don’t coarsen the stateful components because we would like the scheduler to have as much freedom as possible in scheduling small tasks.
  25. Color coding the graph for easier reference We can map this graph to a multicore, following the structure of the graph and taking advantage of data and task parallelism And each exeution of the graph would require 21 time units
  26. But we can do better
  27. because we are executing the stream graph repeatedly, we can unroll successive iterations…
  28. compare to 9.5
  29. put in all the bars for this, grey out other bars, the outline and the fill explain the vocoder and the radar app and why they do so well redo colors! explain the other minor speedups explain mpeg2decoder comment on state, is it going to become more important mpeg4, h264?
  30. Mention that our previous work utilized fine-grained processor-to-processor communication for hardware pipelining
  31. brook, only data parallelism, actors are required to be stateless (support for reduction), focused on ilp and data-parallelism, proc/cons relationships are only exploited for memory hierarchy optimizations Das’s recent PACT paper describes scheduling tecnhniques for brook targeting imagine, traditional loop scheduling techniques are leveraged but only parallelize memory access with a single compute kernel to hide memory latencies between the stream register file and system memory and to prevent spills from the stream register file. We apply trad loop scheduling techniques to parallelize stateful compute components (filters) and we target a multicore architecture. We exploit data-parallelism at a coarser-granularity, across cores, and we fuse compute components to match the granularity of the architecture Cg, pipeline parallelism and data parallelism but only for 2 pipeline stages of a graphics processor streamit has more emphasis on exploiting task and pipeline parallelis m These languages make it difficult to coarsen the granularity and software pipeline robust framework for dp that focuses on reductions (which we have not focused on), half of our benchmarks could be parallelized using mapreduce Stress comparison to explicitly parallel langauges, we get great speedup without programmer intervention, the program is written in a portable manner that Allows the compiler to produce efficient code. Ptolemy is an environment for simulation, prototyping, and software synthesis for heterogeneous systems. It includes a SDF component, but the system focuses on simulation and modeling and on code generation for actual architectures. intel and amd are pushing openmp and mpi to program their multicores. these systems graft parallel constructs onto c and fortan. They are not composable and are had to debug. They place the parallelization burden on the programmer. The programmer is forced to make granularity, load balancing, locality and synchronization decisions through profiling the code. We move these decisions into the compiler and lower the bar for parallel programming achieving portability and high-performance.
  32. Task parallelism is inadequate because the parallelism and synchronization is not matched to the target, forcing the programmer to intervene and create un-portable code Fine-grained data parallelism has good parallelism, but would overwhelm the communication mechanism of a multicore Coarsening the granularity before data-parallelism is exploited and achieve great parallelization of stateless components Finally, adding software pipelining allows us to parallelize stateful components and offers the best parallelism and the lowest synchronization because of the further opportunities for coarsening conscious of the multicores communication substrate, we don’t want to overwhelm it Our algorithms can remain largely unchanged across multicore architectures