[214]유연하고 확장성 있는 빅데이터 처리

Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
1

Data Processing from 10,000 Feet
2
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...

3
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
Existing frameworks perform poorly in new resource
environments (e.g., disaggregation, transient resources)

Disaggregation
4
Compute Storage
(Ref. OpenCompute)
Intermediate data generated from compute nodes
should be written to and read from storage nodes.

Transient Resources
5
Preemption!
Task preemption can cause expensive recomputation.

Cross Datacenter
6
Wide-area network bandwidth is scarce and expensive

7
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
It is hard to add new application optimization features
to existing frameworks.

Dynamic Optimization
Dynamic skew handling
Optimizing job execution based on its characteristics
Adapting execution to resource elasticity
8

Key Observation
Current data processing frameworks
are not flexible and extensible.
9
=> A new flexible and extensible data processing system

Onyx Architecture
Dataflow Program
Onyx Compiler
Onyx Runtime
Cluster
10

Onyx Compiler
11
Beam Program
Physical Execution Plan
OnyxCompiler
Beam Frontend
Onyx Backend
Spark Frontend
Spark Program
IR
DAG

IR (Intermediate Representation) DAG
: Program-agnostic DAG with Annotations
12
Vertex Edge
Vertex Labels
Type: Operator/Loop
Placement: GPUNode/
ReservedNode/TransientNode/Any
Parallelism
Edge Labels
Type: 1:1/Broadcast/Shuffle
Mode: Push/Pull
Storage: Memory/Disk/RemoteDisk

MapReduce Example
13
Shuffle,Pull,Disk
Classical MapReduce
Small-scale MapReduce
Shuffle,Push,Memory

Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
14
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Variations

Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Compiler Passes
15
Common
Specialized
Specialized
Variations

Compiler Passes
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
16

Compiler Passes
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
17
Specialized

Compiler to Runtime
1818
Type: “Map” Operator
Placement: “Compute” Node
Parallelism: 100
Shuffle,Pull,Disk
Type: “Reduce” Operator
Parallelism: 50
Reduce StageMap Stage
Optimized IR DAG

Compiler to Runtime
1919
PhysicalStage PhysicalStage
“Map”Tasks “Reduce”Tasks.
.
.
.
.
.
.
X 100
.
.
X 50
I/O channels for
intermediate data flow
between tasks
Physical DAG

Distributed Execution in Onyx Runtime
Stage
20
Executor Executor Executor Executor
Master

Master Stage
21
TaskGroup(Tasks)

Master Stage
22

Onyx in Action
● Onyx compiler and runtime components
● Onyx job execution: MR, ALS
● Onyx runtime optimization: dynamic skew handling
● Harnessing transient resources with Onyx
Omitted other optimizations due to time constraints!
24

MapReduce
● We will show two executions of MapReduce using
different settings:
○ Intermediate data is saved in disk, and pulled by the reducers
○ Intermediate data is saved in memory, and pushed to the reducers
● In order to vary the settings, we go through the following
passes:
○ A data store pass
○ A data flow model pass
○ All of these are “Annotation” passes
29

Demo
Map Data in Disk, Pulled
30
Shuffle,Pull,Disk
Reduce
Stage
Map
Stage

Demo
Map Data in Memory, Pushed
31
Shuffle,Push,Memory
Reduce
Stage
Map
Stage

Alternating Least Squares Example
● Alternating Least Square is an ML algorithm used
commonly in recommendation systems.
● Most ML algorithms are iterative processes
=> ALS is one of them!
● But how is this expressed in terms of a DAG? (Acyclic!)
32

Naively…
33
(Read input data) . . . . . . . . . . . . (Write output). . . . . . .
Iteration 1 Iteration 2 Iteration N
But what if we want to decide this
“N” according to some condition?
(ex. model convergence in ML)
A set of operators that executes the ALS algorithm

Something special we have for the ALS example: Loops!
34
(Read input data) . . . . . . . . . . . . (Write output)
LoopVertex
with termination condition
(Read input data) . . . . . . . . . (Write output). . . . . .
Iteration 1 Iteration NIteration 2

Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
36
Onyx Compiler
Onyx Runtime
AnnotationPass(es) and
ReshapingPass(es)
IR DAG

37
Onyx Compiler
Onyx Runtime
Physical DAG Conversion
Shuffle,Pull,Disk
StageStage
Optimized IR DAG

38
Onyx Compiler
Onyx Runtime
Physical DAG
Physical DAG Conversion

39
Onyx Compiler
Onyx Runtime
Execute!
Physical DAG

40
Onyx Compiler
Onyx Runtime
Data Size Metric
Physical DAG Executing...

41
Onyx Compiler
Onyx Runtime
New DAG
RuntimePass(es)

42
Onyx Compiler
Onyx Runtime
Execute! New DAG

Demo
Dynamic Data Partitioning
43

Harnessing Transient Resources with Onyx
44

Harnessing Transient Resources with Onyx
45
Using the techniques introduced in
Pado: A Data Processing Engine for
Harnessing Transient Resources in Datacenters
from EuroSys 2017

Batch Engine
46
MapReduce
Flume
Spark
...
Transient Resources
?

47
Transient Resources
Resources borrowed from
over-provisioned latency-critical jobs
(search service, online mall, etc.)

Data Analytics with Transient Resources
48
....
Dataflow
Program
Transient

49
....
Dataflow
Program
Execute! Transient
Tasks Tasks Tasks Tasks

50
....
Dataflow
Program
Execute! Transient

51
....
Dataflow
Program
Execute! Transient
Data
Data
Data

Solution
52
....
Dataflow
Program Transient

Solution
53
....
Dataflow
Program Transient
Analyze

Solution
54
....
Dataflow
Program
Other
Computations
Valuable
Computations Reserved
Transient
Analyze

Valuable
Our definition of Valuable computations
Not so valuable
One-to-One One-to-Many Many-to-One Many-to-Many

Valuable
Our definition of Valuable computations
Not so valuable
One-to-One One-to-Many Many-to-One Many-to-Many
... ... ... ...

Map-Reduce with Transient Containers
(Case #1) Batch Engines (e.g., Spark)
(Case #2) Our Approach 57
Many-to-Many
Map Reduce

Batch Engines (e.g., Spark)
2 Transient, 1 Reserved Containers 58
Our Approach
ReservedTransient

Map, Reduce tasks on each
container 59
ReservedTransient
Our Approach
Map1 Map2 Map3
Reduce1 Reduce2 Reduce3

60
No dependency Many-to-Many
Map Reduce
Many-to-Many
Map Reduce

61
No dependency
⇒ Not so valuable
⇒ Transient
Many-to-Many
⇒ Valuable
⇒ Reserved
Map Reduce
Many-to-Many
Map Reduce

Map tasks on Transient and
Reduce task on Reserved 62
Our Approach
Map1 Map2 Map3
Reduce1 Reduce2 Reduce3 Reduce1
Map1 Map2
ReservedTransient

63
Our Approach
Map1 Map2 Map3
Maintain Map Outputs
on Local Disks
ReservedTransient

64
Our Approach
Map1 Map2 Map3 Map1 Map2
Push Map Outputs to Destination
Reserved Containers
ReservedTransient

65
Our Approach
Pull Map Outputs
Map1 Map2
ReservedTransient

66
Our Approach
ReservedTransient
Reduce1
Read Input Data from Local
Reserved Containers

67
Our Approach
Eviction of Transient Containers
→ Map Outputs Destroyed
ReservedTransient
Reduce1

68
Our Approach
ReservedTransient
Reduce1
Eviction of Transient Containers
→ Map Outputs Not Destroyed

69
Our Approach
Map1 Map2 Map3
Cascading Recomputation of
5 Tasks
ReservedTransient
Reduce1
No Recomputation

Step 1:
Transient/Reserved
Executor Placement Pass
70

Operator Placement Example with the
Transient Resource Policy
Multinomial Logistic Regression(MLR)
: Machine learning application for classifying
inputs, like tumors as malignant or benign, and
ad clicks as profitable or not.
Gradients are used to update the regression
model, which is used for prediction.
71

Executor Placement Example
Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
72
One-to-One
One-to-Many
Many-to-One Costly!

Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved TransientNo
Dependency
No
Dependency
73
Many-to-One Costly!
One-to-One
One-to-Many

Create
1st
Model
Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
74
Many-to-One Costly!
No Costly Dependency
with Parents
One-to-One
One-to-Many

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved TransientCostly Dependency
with Parent
75
Many-to-One Costly!
One-to-One
One-to-Many
Costly Dependency
with Parent, Pipelined
Create
1st
Model

Step 2:
Data Flow Model Pass
76

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
77
Recall..
Safe! Prone to
evictions :(
Create
1st
Model

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
78
Must evacuate data out of transient executors ASAP
Create
1st
Model

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
79
Push data out as soon as it is ready!
Push
Push Push
Create
1st
Model
Push

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
80
No need to hurry for data in Reserved containers
Pull Pull
Push
Push Push
Create
1st
Model
Push

Step 3:
Stage Partitioning Pass
81

Stage Partitioning in Compiler
82
Execute subgraph-by-subgraph
⇒ Partition into subgraphs
⇒ Good abstraction for handling evictions/faults

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Reserved Transient
83
Stage Partitioning Example
Create
1st
Model

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Reserved Transient
84
Create
1st
Model

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Stage 2
Reserved Transient
85
Create
1st
Model

Compute
Gradient
Aggr
Gradient
Compute
2nd
Model
Read
Training
Data
....
Stage 1
Stage 2
Reserved Transient
86
Stage 3
Create
1st
Model

Demo
Executor Placement Pass
DataFlowModel Pass
Stage Partitioning Pass
with MLR example
87

Batch Engines
88
Spark 2.0.0
Onyx with
suggested
optimizations
VS

Containers
● Amazon EC2s(with local SSDs) as containers
● 40 Transient Containers, 5 Reserved Containers
● All containers used for computation
89

Workloads
● Alternating Least Squares
Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta
Information, v. 1.0. https://webscope. sandbox.yahoo.com/catalog.php?datatype=r
● Multinomial Logistic Regression
Synthetic
● Map-Reduce
Page view statistics for Wikimedia projects.
https://dumps.wikimedia.org/other/pagecounts-raw
90

Job Completion Time (Lower is Better)
91
4.13x
3.52x
5.15x

Summary
● Introduces a new data processing system that is flexible
and extensible
○ Compiler that represents various execution policies
○ Runtime that are modular and reconfigurable
● Adapts data processing seamlessly for new deployment
and application requirements
92

93
We are working on creating an Apache incubator
project. We look forward contribution from many
developers!
We are hiring software developers!
Contact: onyx@spl.snu.ac.kr
Software platform lab site: http://spl.snu.ac.kr

Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
94

[214]유연하고 확장성 있는 빅데이터 처리

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à [214]유연하고 확장성 있는 빅데이터 처리

Similaire à [214]유연하고 확장성 있는 빅데이터 처리 (20)

Plus de NAVER D2

Plus de NAVER D2 (20)

Dernier

Dernier (20)

[214]유연하고 확장성 있는 빅데이터 처리