7. Data Processing from 10,000 Feet
7
Data Processing Application
Data Processing Framework
Resource Environment
Spark, Flink,
Hadoop MR,
Dryad, Tez,
...
It is hard to add new application optimization features
to existing frameworks.
8. Dynamic Optimization
Dynamic skew handling
Optimizing job execution based on its characteristics
Adapting execution to resource elasticity
8
9. Key Observation
Current data processing frameworks
are not flexible and extensible.
9
=> A new flexible and extensible data processing system
14. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
14
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Variations
15. Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Annotation Pass examples:
● Parallelism Pass
● Executor Placement Pass
● Data Flow Model Pass
● Stage Partitioning Pass
● Transient Resource EP Pass
● Transient Resource DFM Pass
● Resource Disaggregation EP Pass
● Resource Disaggregation DFM Pass
Compiler Passes
15
Common
Specialized
Specialized
Variations
16. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
16
17. Compiler Passes
Transform an IR DAG into an optimized IR DAG after a series of “passes”
Compile-Time Reshaping Pass examples:
● Loop Extraction Pass
● Loop Fusion Pass (Loop Optimization)
● Common Subexpression Elimination Pass
● Data Skew Reshaping Pass
Runtime Pass example:
● Data Skew Runtime Pass
17
Specialized
18. Compiler to Runtime
1818
Type: “Map” Operator
Placement: “Compute” Node
Parallelism: 100
Shuffle,Pull,Disk
Type: “Reduce” Operator
Placement: “Compute” Node
Parallelism: 50
Reduce StageMap Stage
Optimized IR DAG
19. Compiler to Runtime
1919
PhysicalStage PhysicalStage
“Map”Tasks “Reduce”Tasks.
.
.
.
.
.
.
X 100
.
.
X 50
I/O channels for
intermediate data flow
between tasks
Physical DAG
24. Onyx in Action
● Onyx compiler and runtime components
● Onyx job execution: MR, ALS
● Onyx runtime optimization: dynamic skew handling
● Harnessing transient resources with Onyx
Omitted other optimizations due to time constraints!
24
29. MapReduce
● We will show two executions of MapReduce using
different settings:
○ Intermediate data is saved in disk, and pulled by the reducers
○ Intermediate data is saved in memory, and pushed to the reducers
● In order to vary the settings, we go through the following
passes:
○ A data store pass
○ A data flow model pass
○ All of these are “Annotation” passes
29
32. Alternating Least Squares Example
● Alternating Least Square is an ML algorithm used
commonly in recommendation systems.
● Most ML algorithms are iterative processes
=> ALS is one of them!
● But how is this expressed in terms of a DAG? (Acyclic!)
32
33. Alternating Least Squares Example
Naively…
33
(Read input data) . . . . . . . . . . . . (Write output). . . . . . .
Iteration 1 Iteration 2 Iteration N
But what if we want to decide this
“N” according to some condition?
(ex. model convergence in ML)
A set of operators that executes the ALS algorithm
34. Alternating Least Squares Example
Something special we have for the ALS example: Loops!
34
(Read input data) . . . . . . . . . . . . (Write output)
LoopVertex
with termination condition
(Read input data) . . . . . . . . . (Write output). . . . . .
Iteration 1 Iteration NIteration 2
36. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
36
Onyx Compiler
Onyx Runtime
AnnotationPass(es) and
ReshapingPass(es)
IR DAG
37. Dynamic Data Partitioning Example
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
37
Onyx Compiler
Onyx Runtime
Physical DAG Conversion
Shuffle,Pull,Disk
StageStage
Optimized IR DAG
38. Dynamic Data Partitioning Example
38
Onyx Compiler
Onyx Runtime
PhysicalStage PhysicalStage
Physical DAG
Physical DAG Conversion
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
39. Dynamic Data Partitioning Example
39
Onyx Compiler
Onyx Runtime
Execute!
PhysicalStage PhysicalStage
Physical DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
40. Dynamic Data Partitioning Example
40
Onyx Compiler
Onyx Runtime
Data Size Metric
Physical DAG Executing...
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
41. Dynamic Data Partitioning Example
41
Onyx Compiler
Onyx Runtime
New DAG
RuntimePass(es)
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
42. Dynamic Data Partitioning Example
42
Onyx Compiler
Onyx Runtime
Execute! New DAG
● What happens if there is a data skew while executing a job?
● How do we detect such a data skew and partition data appropriately?
45. Harnessing Transient Resources with Onyx
45
Using the techniques introduced in
Pado: A Data Processing Engine for
Harnessing Transient Resources in Datacenters
from EuroSys 2017
71. Operator Placement Example with the
Transient Resource Policy
Multinomial Logistic Regression(MLR)
: Machine learning application for classifying
inputs, like tumors as malignant or benign, and
ad clicks as profitable or not.
Gradients are used to update the regression
model, which is used for prediction.
71
89. Containers
● Amazon EC2s(with local SSDs) as containers
● 40 Transient Containers, 5 Reserved Containers
● All containers used for computation
89
90. Workloads
● Alternating Least Squares
Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta
Information, v. 1.0. https://webscope. sandbox.yahoo.com/catalog.php?datatype=r
● Multinomial Logistic Regression
Synthetic
● Map-Reduce
Page view statistics for Wikimedia projects.
https://dumps.wikimedia.org/other/pagecounts-raw
90
92. Summary
● Introduces a new data processing system that is flexible
and extensible
○ Compiler that represents various execution policies
○ Runtime that are modular and reconfigurable
● Adapts data processing seamlessly for new deployment
and application requirements
92
93. 93
We are working on creating an Apache incubator
project. We look forward contribution from many
developers!
We are hiring software developers!
Contact: onyx@spl.snu.ac.kr
Software platform lab site: http://spl.snu.ac.kr
94. Onyx:
A Flexible and Extensible
Data Processing System
전병곤, 김주연, 송원욱
Software Platform Lab
Joint work with 양영석, 이산하, 서장호, 어정윤, 이계원, 엄태건, 이우연,
이윤성, 정주성, 하현민, 정은지, 김수정, 유경인, 신동진
94